[ 
https://issues.apache.org/jira/browse/KAFKA-702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13555148#comment-13555148
 ] 

Jun Rao commented on KAFKA-702:
-------------------------------

First of all, +1 on the simple patch. I think it solves the immediate problem.

For socket selector, my understanding after reading the java doc is that the 
selected keys are always there unless you explicitly remove it. In other words, 
those selected keys won't magically go away after the socket key is being 
consumed. Everytime we call select(), only newly available keys are added and 
existing selected keys are untouched. So, even if we have finished reading from 
a socket, if the key is not automatically removed and select() will still give 
the same key back. However, my suggestion is probably worse than this patch. 
Before the request queue has space again, the processor thread could be doing 
the busy loop by keeping trying to add the same request from a socket to the 
request queue.

A second thing is that, currently, the number of outstanding requests on the 
broker is bounded by the number of clients since each client can have at most 
one outstanding request. So, if we bound the number of clients, we can somewhat 
bound the memory used by outstanding requests. This limit is probably useful 
for not running out of open file handlers too.
                
> Deadlock between request handler/processor threads
> --------------------------------------------------
>
>                 Key: KAFKA-702
>                 URL: https://issues.apache.org/jira/browse/KAFKA-702
>             Project: Kafka
>          Issue Type: Bug
>          Components: network
>    Affects Versions: 0.8
>            Reporter: Joel Koshy
>            Assignee: Jay Kreps
>            Priority: Blocker
>              Labels: bugs
>             Fix For: 0.8
>
>         Attachments: KAFKA-702-v1.patch
>
>
> We have seen this a couple of times in the past few days in a test cluster. 
> The request handler and processor threads deadlock on the request/response 
> queues bringing the server to a halt
> "kafka-processor-10251-7" prio=10 tid=0x00007f4a0c3c9800 nid=0x4c39 waiting 
> on condition [0x00007f46f698e000]
>    java.lang.Thread.State: WAITING (parking)
>         at sun.misc.Unsafe.park(Native Method)
>         - parking to wait for  <0x00007f48c9dd2698> (a 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
>         at java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
>         at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1987)
>         at 
> java.util.concurrent.ArrayBlockingQueue.put(ArrayBlockingQueue.java:252)
>         at kafka.network.RequestChannel.sendRequest(RequestChannel.scala:107)
>         at kafka.network.Processor.read(SocketServer.scala:321)
>         at kafka.network.Processor.run(SocketServer.scala:231)
>         at java.lang.Thread.run(Thread.java:619)
> "kafka-request-handler-7" daemon prio=10 tid=0x00007f4a0c57f000 nid=0x4c47 
> waiting on condition [0x00007f46f5b80000]
>    java.lang.Thread.State: WAITING (parking)
>         at sun.misc.Unsafe.park(Native Method)
>         - parking to wait for  <0x00007f48c9dd6348> (a 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
>         at java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
>         at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1987)
>         at 
> java.util.concurrent.ArrayBlockingQueue.put(ArrayBlockingQueue.java:252)
>         at kafka.network.RequestChannel.sendResponse(RequestChannel.scala:112)
>         at kafka.server.KafkaApis.handleProducerRequest(KafkaApis.scala:198)
>         at kafka.server.KafkaApis.handle(KafkaApis.scala:58)
>         at kafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:41)
>         at java.lang.Thread.run(Thread.java:619)
> This is because there is a cycle in the wait-for graph of processor threads 
> and request handler threads. If the request handling slows down on a busy 
> server, the request queue fills up. All processor threads quickly block on 
> adding incoming requests to the request queue. Due to this, those threads do 
> not processes responses filling up their response queues. At this moment, the 
> request handler threads start blocking on adding responses to the respective 
> response queues. This can lead to a deadlock where every thread is holding a 
> lock on one queue and asking a lock for the other queue. This brings the 
> server to a halt where it accepts connections but every request gets timed 
> out.
> One way to resolve this is by breaking the cycle in the wait-for graph of the 
> request handler and processor threads. Instead of having the processor 
> threads dispatching the responses, we can have one or more dedicated response 
> handler threads that dequeue responses from the queue and write those on the 
> socket. One downside of this approach is that now access to the selector will 
> have to be synchronized.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to