Hi,
Can someone discuss it in KAFKA-4360, thanks.
> 在 2016年11月1日,上午10:54,huxi (JIRA) <j...@apache.org> 写道:
>
>
> [
> https://issues.apache.org/jira/browse/KAFKA-4360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15624155#comment-15624155
> ]
>
> huxi commented on KAFKA-4360:
> -----------------------------
>
> Excellent analysis! What I am intrigued is whether this is a deadlock issue
> or a liveness issue. Here is my analysis:
> 1. Say at time T1, the zookeeper session expires, so 'handleNewSession'
> methods for SessionExpirationListener is executed, therefore, obtaining the
> controller lock(controllerContext.controllerLock)
> 2. Then it invokes 'onControllerResignation' method to have the current
> controller quit, which will shutdown leader rebalance scheduler by calling
> KafkaScheduler.shutdown
> 3. In 'shutdown' method, it shuts down the ScheduledThreadPoolExecutor and
> blocks until all tasks have completed execution after a shutdown request
> 4. If there exists any tasks submitted before calling shutdown, the
> check-imbalance thread should get started with checking isActive which
> acquires the controller lock at the very beginning and then soon be blocked
> due to the lock has already been held by the main thread.
> 5. In that case, the main thread will block in onControllerResignation method
> until one day has elapsed by default or you just interrupt the check thread.
>
> Does it make sense?
>
>
>> Controller may deadLock when autoLeaderRebalance encounter zk expired
>> ---------------------------------------------------------------------
>>
>> Key: KAFKA-4360
>> URL: https://issues.apache.org/jira/browse/KAFKA-4360
>> Project: Kafka
>> Issue Type: Bug
>> Components: controller
>> Affects Versions: 0.9.0.0, 0.9.0.1, 0.10.0.0, 0.10.0.1
>> Reporter: Json Tu
>> Labels: bugfix
>> Attachments: yf-mafka2-common02_jstack.txt
>>
>> Original Estimate: 168h
>> Remaining Estimate: 168h
>>
>> when controller has checkAndTriggerPartitionRebalance task in
>> autoRebalanceScheduler,and then zk expired at that time. It will
>> run into deadlock.
>> we can restore the scene as below,when zk session expired,zk thread will
>> call handleNewSession which defined in SessionExpirationListener, and it
>> will get controllerContext.controllerLock,and then it will
>> autoRebalanceScheduler.shutdown(),which need complete all the task in the
>> autoRebalanceScheduler,but that threadPoll also need get
>> controllerContext.controllerLock,but it has already owned by zk callback
>> thread,which will then run into deadlock.
>> because of that,it will cause two problems at least, first is the broker’s
>> id is cannot register to the zookeeper,and it will be considered as dead by
>> new controller,second this procedure can not be stop by
>> kafka-server-stop.sh, because shutdown function
>> can not get controllerContext.controllerLock also, we cannot shutdown kafka
>> except using kill -9.
>> In my attachment, I upload a jstack file, which was created when my kafka
>> procedure cannot shutdown by kafka-server-stop.sh.
>> I have met this scenes for several times,I think this may be a bug that not
>> solved in kafka.
>
>
>
> --
> This message was sent by Atlassian JIRA
> (v6.3.4#6332)