[jira] [Commented] (STORM-839) Deadlock hazard in backtype.storm.messaging.netty.Client

ASF GitHub Bot (JIRA) Fri, 29 May 2015 03:28:02 -0700

    [ 
https://issues.apache.org/jira/browse/STORM-839?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14564539#comment-14564539
 ]


ASF GitHub Bot commented on STORM-839:
--------------------------------------

GitHub user eshioji opened a pull request:

    https://github.com/apache/storm/pull/566

    [STORM-839] Deadlock hazard in backtype.storm.messaging.netty.Client

    (I accidentally did a PR against master with the same content, please 
ignore that one)
    
    This fixes the reported deadlock between `disruptor-worker-transfer-queue` 
thread and `client-worker` thread, which seem to have been introduced by 
STORM-329.
    
    After reviewing the `v0.9.4` code, my conclusion was that the background 
flushing task can be removed without real change in the behavior. By doing 
this, the synchronization that was involved in the deadlock can be removed 
altogether.
    
    My reasoning is as follows:
     - One has three option to deal with Netty's buffer filling up:
      1. Discard incoming new messages
      2. Block client thread until there is space (back pressure)
      3. Keep buffering up until OOME is thrown
     - My guess is that the `v0.9.4` code attempted to implement option (i), 
but actually the behavior is option (iii). When Netty's `Channel.isWritable` 
returns false in `Client.send`, the thread avoids writing to the `Channel` and 
leaves messages in the `Channel.messageBatch` field, which the background task 
flushes when the `Channel.isWritable` starts to return true again. However, if 
`Client.send` is called again before this (which should be common), the content 
of `Channel.messageBatch` is written and buffered on the `Channel` anyways, 
because `flushMessages` does not check `Channel.isWritable`.
     - AFAIU we like option (ii), but this requires more work. Between option 
(i) and option (iii), IMO (iii) is superior because if OOME happens, the user 
can reduce MAX_PENDING_TUPLES to prevent this. Discarding messages would be 
harder to diagnose.
     - If we are content with option (ii), we can remove the background 
flushing task and related synchronization altogether, removing the source of 
deadlock. We'd simply write and buffer onto the unbounded buffer of `Channel`, 
and if there are too many pending messages, the worker will die of OOME, and 
the user should reduce this with indirect means like reducing 
MAX_PENDING_TUPLES until option (ii) is implemented in the future.
    
    Thoughts @miguno ?
    
    
    
     

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/eshioji/storm STORM-839

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/storm/pull/566.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #566
    
----
commit ed8ab3ec194f19c75fc2f5c000609204f04b50e8
Author: Enno Shioji <[email protected]>
Date:   2015-05-28T19:42:05Z

    Simplified the flow and removed the lock that was causing the deadlock

commit 91b8eb3840432e47b79f40abebec8304627732a8
Author: Enno Shioji <[email protected]>
Date:   2015-05-28T19:46:17Z

    Bump version

commit b7d84bdc7fd3de34f45a94131cdbb6bfbd3763dc
Author: Enno Shioji <[email protected]>
Date:   2015-05-28T21:27:31Z

    Remove background flushing because it doesn't seem necessary. Netty's 
Channel queues up written data on an unbounded buffer. The background flushing 
seems to have been added to avoid this, but in practice it was probably doing 
it anyways because flushMessages(), which is called by send() doesn't check for 
isWritable. Moreover, queuing on an unbounded buffer seems fine because back 
pressure is provided by MAX_PENDING_TUPLE. If OOME occurs due to this buffer 
overflowing, it seems reasonable that one has to reduce MAX_PENDING_TUPLE, 
rather than Storm trying to cope with it by dropping messages.

commit 679e42bc1e38f51c2759667b03cb45322c6a793b
Author: Enno Shioji <[email protected]>
Date:   2015-05-28T21:31:35Z

    Change to a SNAPSHOT version for deployment purposes

commit 27a92e2aa3488c0203f500306e0583ff9e7e1e82
Author: Enno Shioji <[email protected]>
Date:   2015-05-29T09:32:16Z

    Remove (now) dead comment and code

commit 09bf6e1b5d9d351f2a60cd9a32e0239752cf437a
Author: Enno Shioji <[email protected]>
Date:   2015-05-29T10:23:46Z

    Merge branch '0.9.x-branch' into STORM-839
    
    Conflicts:
        examples/storm-starter/pom.xml
        external/storm-hbase/pom.xml
        external/storm-hdfs/pom.xml
        external/storm-kafka/pom.xml
        pom.xml
        storm-buildtools/maven-shade-clojure-transformer/pom.xml
        storm-core/pom.xml
        storm-dist/binary/pom.xml
        storm-dist/source/pom.xml

----


> Deadlock hazard in backtype.storm.messaging.netty.Client
> --------------------------------------------------------
>
>                 Key: STORM-839
>                 URL: https://issues.apache.org/jira/browse/STORM-839
>             Project: Apache Storm
>          Issue Type: Bug
>    Affects Versions: 0.9.4
>            Reporter: Enno Shioji
>            Priority: Critical
>
> See the thread dump below that shows the deadlock. client-worker-1 is holding 
> 7b5a7fa5 and waiting on 1446a1e9. Thread-10-disruptor-worker-transfer-queue 
> is holding 1446a1e9 and is waiting on 7b5a7fa5.
> (Thread dump is truncated to show only the relevant parts)
> 2015-05-28 15:37:15
> Full thread dump Java HotSpot(TM) 64-Bit Server VM (24.72-b04 mixed mode):
> "Thread-10-disruptor-worker-transfer-queue" - Thread t@52
>    java.lang.Thread.State: BLOCKED
>       at 
> org.apache.storm.netty.channel.socket.nio.AbstractNioWorker.cleanUpWriteBuffer(AbstractNioWorker.java:398)
>       - waiting to lock <7b5a7fa5> (a java.lang.Object) owned by 
> "client-worker-1" t@25
>       at 
> org.apache.storm.netty.channel.socket.nio.AbstractNioWorker.writeFromUserCode(AbstractNioWorker.java:128)
>       at 
> org.apache.storm.netty.channel.socket.nio.NioClientSocketPipelineSink.eventSunk(NioClientSocketPipelineSink.java:84)
>       at 
> org.apache.storm.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendDownstream(DefaultChannelPipeline.java:779)
>       at org.apache.storm.netty.channel.Channels.write(Channels.java:725)
>       at 
> org.apache.storm.netty.handler.codec.oneone.OneToOneEncoder.doEncode(OneToOneEncoder.java:71)
>       at 
> org.apache.storm.netty.handler.codec.oneone.OneToOneEncoder.handleDownstream(OneToOneEncoder.java:59)
>       at 
> org.apache.storm.netty.channel.DefaultChannelPipeline.sendDownstream(DefaultChannelPipeline.java:591)
>       at 
> org.apache.storm.netty.channel.DefaultChannelPipeline.sendDownstream(DefaultChannelPipeline.java:582)
>       at org.apache.storm.netty.channel.Channels.write(Channels.java:704)
>       at org.apache.storm.netty.channel.Channels.write(Channels.java:671)
>       at 
> org.apache.storm.netty.channel.AbstractChannel.write(AbstractChannel.java:248)
>       at backtype.storm.messaging.netty.Client.flushMessages(Client.java:480)
>       - locked <1446a1e9> (a backtype.storm.messaging.netty.Client)
>       at backtype.storm.messaging.netty.Client.send(Client.java:412)
>       - locked <1446a1e9> (a backtype.storm.messaging.netty.Client)
>       at backtype.storm.utils.TransferDrainer.send(TransferDrainer.java:54)
>       at 
> backtype.storm.daemon.worker$mk_transfer_tuples_handler$fn__5014$fn__5015.invoke(worker.clj:334)
>       at 
> backtype.storm.daemon.worker$mk_transfer_tuples_handler$fn__5014.invoke(worker.clj:332)
>       at 
> backtype.storm.disruptor$clojure_handler$reify__1446.onEvent(disruptor.clj:58)
>       at 
> backtype.storm.utils.DisruptorQueue.consumeBatchToCursor(DisruptorQueue.java:125)
>       at 
> backtype.storm.utils.DisruptorQueue.consumeBatchWhenAvailable(DisruptorQueue.java:99)
>       at 
> backtype.storm.disruptor$consume_batch_when_available.invoke(disruptor.clj:80)
>       at 
> backtype.storm.disruptor$consume_loop_STAR_$fn__1459.invoke(disruptor.clj:94)
>       at backtype.storm.util$async_loop$fn__458.invoke(util.clj:463)
>       at clojure.lang.AFn.run(AFn.java:24)
>       at java.lang.Thread.run(Unknown Source)
>    Locked ownable synchronizers:
>       - None
> "client-worker-1" - Thread t@25
>    java.lang.Thread.State: BLOCKED
>       at 
> backtype.storm.messaging.netty.Client.closeChannelAndReconnect(Client.java:501)
>       - waiting to lock <1446a1e9> (a backtype.storm.messaging.netty.Client) 
> owned by "Thread-10-disruptor-worker-transfer-queue" t@52
>       at backtype.storm.messaging.netty.Client.access$1400(Client.java:78)
>       at 
> backtype.storm.messaging.netty.Client$3.operationComplete(Client.java:492)
>       at 
> org.apache.storm.netty.channel.DefaultChannelFuture.notifyListener(DefaultChannelFuture.java:427)
>       at 
> org.apache.storm.netty.channel.DefaultChannelFuture.notifyListeners(DefaultChannelFuture.java:413)
>       at 
> org.apache.storm.netty.channel.DefaultChannelFuture.setFailure(DefaultChannelFuture.java:380)
>       at 
> org.apache.storm.netty.channel.socket.nio.AbstractNioWorker.cleanUpWriteBuffer(AbstractNioWorker.java:437)
>       - locked <7b5a7fa5> (a java.lang.Object)
>       at 
> org.apache.storm.netty.channel.socket.nio.AbstractNioWorker.close(AbstractNioWorker.java:373)
>       at 
> org.apache.storm.netty.channel.socket.nio.NioWorker.read(NioWorker.java:93)
>       at 
> org.apache.storm.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:108)
>       at 
> org.apache.storm.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:318)
>       at 
> org.apache.storm.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89)
>       at 
> org.apache.storm.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
>       at 
> org.apache.storm.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
>       at 
> org.apache.storm.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
>       at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
>       at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
>       at java.lang.Thread.run(Unknown Source)
>    Locked ownable synchronizers:
>       - locked <75e528fd> (a java.util.concurrent.ThreadPoolExecutor$Worker)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (STORM-839) Deadlock hazard in backtype.storm.messaging.netty.Client

Reply via email to