[
https://issues.apache.org/jira/browse/STORM-839?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14564520#comment-14564520
]
ASF GitHub Bot commented on STORM-839:
--------------------------------------
GitHub user eshioji opened a pull request:
https://github.com/apache/storm/pull/565
STORM-839
This fixes the reported deadlock between `disruptor-worker-transfer-queue`
thread and `client-worker` thread, which seem to have been introduced by
STORM-329.
After reviewing the `v0.9.4` code, my conclusion was that the background
flushing task can be removed without real change in the behavior. By doing
this, the synchronization that was involved in the deadlock can be removed
altogether.
My reasoning is as follows:
- One has three option to deal with Netty's buffer filling up:
1. Discard incoming new messages
2. Block client thread until there is space (back pressure)
3. Keep buffering up until OOME is thrown
- My guess is that the `v0.9.4` code attempted to implement option (i),
but actually the behavior is option (iii). When Netty's `Channel.isWritable`
returns false in `Client.send`, the thread avoids writing to the `Channel` and
leaves messages in the `Channel.messageBatch` field, which the background task
flushes when the `Channel.isWritable` starts to return true again. However, if
`Client.send` is called again before this (which should be common), the content
of `Channel.messageBatch` is written and buffered on the `Channel` anyways,
because `flushMessages` does not check `Channel.isWritable`.
- AFAIU we like option (ii), but this requires more work. Between option
(i) and option (iii), IMO (iii) is superior because if OOME happens, the user
can reduce MAX_PENDING_TUPLES to prevent this. Discarding messages would be
harder to diagnose.
- If we are content with option (ii), we can remove the background
flushing task and related synchronization altogether, removing the source of
deadlock. We'd simply write and buffer onto the unbounded buffer of `Channel`,
and if there are too many pending messages, the worker will die of OOME, and
the user should reduce this with indirect means like reducing
MAX_PENDING_TUPLES until option (ii) is implemented in the future.
Thoughts @miguno ?
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/eshioji/storm STORM-839
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/storm/pull/565.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #565
----
commit 98f4e619d54052b73a309d23ab7214953e4c7774
Author: Sriharsha Chintalapani <[email protected]>
Date: 2015-02-05T18:08:05Z
STORM-130: Supervisor getting killed due to java.io.FileNotFoundException:
File '../stormconf.ser' does not exist.
Signed-off-by: P. Taylor Goetz <[email protected]>
commit a1e5893e1b94c224d39fedf11583b216c21351c8
Author: P. Taylor Goetz <[email protected]>
Date: 2015-02-24T20:46:12Z
update changelog for STORM-130
commit 62788f295bb1fb1cc83b99c30f82beb40eea5f25
Author: P. Taylor Goetz <[email protected]>
Date: 2015-02-24T23:03:40Z
port STORM-329 fix to 0.9.x
commit 81016c2ed7222da99138bc9971e335533d4cb518
Author: Michael G. Noll <[email protected]>
Date: 2015-02-16T09:01:27Z
Track how many messages are being dropped when a connection is unavailable
Signed-off-by: P. Taylor Goetz <[email protected]>
commit 97a76fc896de508f015dbe32f1473ddbf10d736b
Author: Michael G. Noll <[email protected]>
Date: 2015-02-16T09:03:07Z
Clarify name of method for dropping messages
Signed-off-by: P. Taylor Goetz <[email protected]>
commit 9138d9fc255639b4d0d43657379ce467591e8ef2
Author: Michael G. Noll <[email protected]>
Date: 2015-02-16T09:07:35Z
Change log level for intentionally dropping messages from WARN to ERROR
This change makes the log level for dropping messages consistent in
Client.java.
Signed-off-by: P. Taylor Goetz <[email protected]>
commit 6b06d8468ff5e743fb12b85dd84fe0931041c2c3
Author: P. Taylor Goetz <[email protected]>
Date: 2015-02-24T23:18:43Z
add STORM-329 to changelog
commit e63fb2af9086e2b2e688662ca42a4b4d0112274b
Author: Parth Brahmbhatt <[email protected]>
Date: 2015-03-03T00:06:58Z
STORM-693: when bolt fails to write tuple, it should report error instead
of silently acking.
Signed-off-by: P. Taylor Goetz <[email protected]>
commit 92836de540ec8ab90d7591b96ba02126e80b5c3a
Author: P. Taylor Goetz <[email protected]>
Date: 2015-03-18T14:59:56Z
add STORM-693 to changelog
commit c19e482b70f18d690ad165c78551860506486095
Author: Parth Brahmbhatt <[email protected]>
Date: 2015-02-20T19:56:22Z
STORM-682: supervisor should handle worker state corruption gracefully.
Signed-off-by: P. Taylor Goetz <[email protected]>
commit f0de11a20fe2f20dc1dc2f485549e0dc342f8680
Author: P. Taylor Goetz <[email protected]>
Date: 2015-03-18T15:05:30Z
add STORM-682 to changelog
commit 835a410c879dc1eb02d9670410f65fe0be6f28c6
Author: Parth Brahmbhatt <[email protected]>
Date: 2015-01-14T20:27:35Z
STORM-559: ZkHosts in README should use 2181 as port.
Signed-off-by: P. Taylor Goetz <[email protected]>
commit 30e0be8616c89cb1f8a51fcf462f76a075e6e964
Author: P. Taylor Goetz <[email protected]>
Date: 2015-03-18T15:11:16Z
add STORM-559 to changelog
commit b1bbacb7134d17ff47c2e8b8857a66244a4d1d4f
Author: P. Taylor Goetz <[email protected]>
Date: 2015-03-18T15:28:11Z
add missing import in supervisor.clj
commit edf596bac8feab0c8721f7de94474e5549858355
Author: P. Taylor Goetz <[email protected]>
Date: 2015-03-18T16:21:38Z
[maven-release-plugin] prepare release v0.9.4
commit 48d10e20eb3c750fc41fcf0bef3d49501cf6d5a4
Author: P. Taylor Goetz <[email protected]>
Date: 2015-03-18T16:21:45Z
[maven-release-plugin] prepare for next development iteration
commit 41f44f9914d4f27d0db3f211a85f88301533f09b
Author: P. Taylor Goetz <[email protected]>
Date: 2015-03-18T17:59:39Z
Revert "[maven-release-plugin] prepare for next development iteration"
This reverts commit 48d10e20eb3c750fc41fcf0bef3d49501cf6d5a4.
commit 233603c3cbd729fdfabd2759bfa7705811996aa4
Author: P. Taylor Goetz <[email protected]>
Date: 2015-03-18T18:00:11Z
Revert "[maven-release-plugin] prepare release v0.9.4"
This reverts commit edf596bac8feab0c8721f7de94474e5549858355.
commit 61e1b5c3e226122143c91bbd7527b605f8ed8727
Author: P. Taylor Goetz <[email protected]>
Date: 2015-03-18T18:05:51Z
add Apache license header to ConnectionWithStatus.java
commit 00091d7952681a39281aa171adfad133a5e26330
Author: P. Taylor Goetz <[email protected]>
Date: 2015-03-18T18:13:57Z
[maven-release-plugin] prepare release v0.9.4
commit ed8ab3ec194f19c75fc2f5c000609204f04b50e8
Author: Enno Shioji <[email protected]>
Date: 2015-05-28T19:42:05Z
Simplified the flow and removed the lock that was causing the deadlock
commit 91b8eb3840432e47b79f40abebec8304627732a8
Author: Enno Shioji <[email protected]>
Date: 2015-05-28T19:46:17Z
Bump version
commit b7d84bdc7fd3de34f45a94131cdbb6bfbd3763dc
Author: Enno Shioji <[email protected]>
Date: 2015-05-28T21:27:31Z
Remove background flushing because it doesn't seem necessary. Netty's
Channel queues up written data on an unbounded buffer. The background flushing
seems to have been added to avoid this, but in practice it was probably doing
it anyways because flushMessages(), which is called by send() doesn't check for
isWritable. Moreover, queuing on an unbounded buffer seems fine because back
pressure is provided by MAX_PENDING_TUPLE. If OOME occurs due to this buffer
overflowing, it seems reasonable that one has to reduce MAX_PENDING_TUPLE,
rather than Storm trying to cope with it by dropping messages.
commit 679e42bc1e38f51c2759667b03cb45322c6a793b
Author: Enno Shioji <[email protected]>
Date: 2015-05-28T21:31:35Z
Change to a SNAPSHOT version for deployment purposes
commit 27a92e2aa3488c0203f500306e0583ff9e7e1e82
Author: Enno Shioji <[email protected]>
Date: 2015-05-29T09:32:16Z
Remove (now) dead comment and code
----
> Deadlock hazard in backtype.storm.messaging.netty.Client
> --------------------------------------------------------
>
> Key: STORM-839
> URL: https://issues.apache.org/jira/browse/STORM-839
> Project: Apache Storm
> Issue Type: Bug
> Affects Versions: 0.9.4
> Reporter: Enno Shioji
> Priority: Critical
>
> See the thread dump below that shows the deadlock. client-worker-1 is holding
> 7b5a7fa5 and waiting on 1446a1e9. Thread-10-disruptor-worker-transfer-queue
> is holding 1446a1e9 and is waiting on 7b5a7fa5.
> (Thread dump is truncated to show only the relevant parts)
> 2015-05-28 15:37:15
> Full thread dump Java HotSpot(TM) 64-Bit Server VM (24.72-b04 mixed mode):
> "Thread-10-disruptor-worker-transfer-queue" - Thread t@52
> java.lang.Thread.State: BLOCKED
> at
> org.apache.storm.netty.channel.socket.nio.AbstractNioWorker.cleanUpWriteBuffer(AbstractNioWorker.java:398)
> - waiting to lock <7b5a7fa5> (a java.lang.Object) owned by
> "client-worker-1" t@25
> at
> org.apache.storm.netty.channel.socket.nio.AbstractNioWorker.writeFromUserCode(AbstractNioWorker.java:128)
> at
> org.apache.storm.netty.channel.socket.nio.NioClientSocketPipelineSink.eventSunk(NioClientSocketPipelineSink.java:84)
> at
> org.apache.storm.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendDownstream(DefaultChannelPipeline.java:779)
> at org.apache.storm.netty.channel.Channels.write(Channels.java:725)
> at
> org.apache.storm.netty.handler.codec.oneone.OneToOneEncoder.doEncode(OneToOneEncoder.java:71)
> at
> org.apache.storm.netty.handler.codec.oneone.OneToOneEncoder.handleDownstream(OneToOneEncoder.java:59)
> at
> org.apache.storm.netty.channel.DefaultChannelPipeline.sendDownstream(DefaultChannelPipeline.java:591)
> at
> org.apache.storm.netty.channel.DefaultChannelPipeline.sendDownstream(DefaultChannelPipeline.java:582)
> at org.apache.storm.netty.channel.Channels.write(Channels.java:704)
> at org.apache.storm.netty.channel.Channels.write(Channels.java:671)
> at
> org.apache.storm.netty.channel.AbstractChannel.write(AbstractChannel.java:248)
> at backtype.storm.messaging.netty.Client.flushMessages(Client.java:480)
> - locked <1446a1e9> (a backtype.storm.messaging.netty.Client)
> at backtype.storm.messaging.netty.Client.send(Client.java:412)
> - locked <1446a1e9> (a backtype.storm.messaging.netty.Client)
> at backtype.storm.utils.TransferDrainer.send(TransferDrainer.java:54)
> at
> backtype.storm.daemon.worker$mk_transfer_tuples_handler$fn__5014$fn__5015.invoke(worker.clj:334)
> at
> backtype.storm.daemon.worker$mk_transfer_tuples_handler$fn__5014.invoke(worker.clj:332)
> at
> backtype.storm.disruptor$clojure_handler$reify__1446.onEvent(disruptor.clj:58)
> at
> backtype.storm.utils.DisruptorQueue.consumeBatchToCursor(DisruptorQueue.java:125)
> at
> backtype.storm.utils.DisruptorQueue.consumeBatchWhenAvailable(DisruptorQueue.java:99)
> at
> backtype.storm.disruptor$consume_batch_when_available.invoke(disruptor.clj:80)
> at
> backtype.storm.disruptor$consume_loop_STAR_$fn__1459.invoke(disruptor.clj:94)
> at backtype.storm.util$async_loop$fn__458.invoke(util.clj:463)
> at clojure.lang.AFn.run(AFn.java:24)
> at java.lang.Thread.run(Unknown Source)
> Locked ownable synchronizers:
> - None
> "client-worker-1" - Thread t@25
> java.lang.Thread.State: BLOCKED
> at
> backtype.storm.messaging.netty.Client.closeChannelAndReconnect(Client.java:501)
> - waiting to lock <1446a1e9> (a backtype.storm.messaging.netty.Client)
> owned by "Thread-10-disruptor-worker-transfer-queue" t@52
> at backtype.storm.messaging.netty.Client.access$1400(Client.java:78)
> at
> backtype.storm.messaging.netty.Client$3.operationComplete(Client.java:492)
> at
> org.apache.storm.netty.channel.DefaultChannelFuture.notifyListener(DefaultChannelFuture.java:427)
> at
> org.apache.storm.netty.channel.DefaultChannelFuture.notifyListeners(DefaultChannelFuture.java:413)
> at
> org.apache.storm.netty.channel.DefaultChannelFuture.setFailure(DefaultChannelFuture.java:380)
> at
> org.apache.storm.netty.channel.socket.nio.AbstractNioWorker.cleanUpWriteBuffer(AbstractNioWorker.java:437)
> - locked <7b5a7fa5> (a java.lang.Object)
> at
> org.apache.storm.netty.channel.socket.nio.AbstractNioWorker.close(AbstractNioWorker.java:373)
> at
> org.apache.storm.netty.channel.socket.nio.NioWorker.read(NioWorker.java:93)
> at
> org.apache.storm.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:108)
> at
> org.apache.storm.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:318)
> at
> org.apache.storm.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89)
> at
> org.apache.storm.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
> at
> org.apache.storm.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
> at
> org.apache.storm.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
> at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
> at java.lang.Thread.run(Unknown Source)
> Locked ownable synchronizers:
> - locked <75e528fd> (a java.util.concurrent.ThreadPoolExecutor$Worker)
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)