[
https://issues.apache.org/jira/browse/STORM-996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14722708#comment-14722708
]
Derek Dagit commented on STORM-996:
-----------------------------------
If I enable the debugging statement that logs how many messages are sent when
sending a batch, I can no longer reproduce the error. As soon as I disable
this statement, the error reappears.
So I thought I would quickly check the actual network traffic. I confirmed by
inspecting packets sent during the netty-unit-test/test-batch that the task
messages were actually sent out-of-order, and it is not a matter of the test
code incorrectly handling what had been received in-order.
Patch:
{code}
diff --git a/storm-core/src/jvm/backtype/storm/messaging/netty/Context.java
b/storm-core/src/jvm/backtype/storm/messaging/netty/Context.java
index 5d27a16..0b59ec6 100644
--- a/storm-core/src/jvm/backtype/storm/messaging/netty/Context.java
+++ b/storm-core/src/jvm/backtype/storm/messaging/netty/Context.java
@@ -18,6 +18,7 @@
package backtype.storm.messaging.netty;
import org.jboss.netty.channel.socket.nio.NioClientSocketChannelFactory;
+import org.jboss.netty.handler.execution.OrderedDownstreamThreadPoolExecutor;
import org.jboss.netty.util.HashedWheelTimer;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
@@ -56,11 +57,10 @@ public class Context implements IContext {
ThreadFactory bossFactory = new
NettyRenameThreadFactory("client" + "-boss");
ThreadFactory workerFactory = new NettyRenameThreadFactory("client" +
"-worker");
if (maxWorkers > 0) {
- clientChannelFactory = new
NioClientSocketChannelFactory(Executors.newCachedThreadPool(bossFactory),
- Executors.newCachedThreadPool(workerFactory), maxWorkers);
+ clientChannelFactory = new NioClientSocketChannelFactory(new
OrderedDownstreamThreadPoolExecutor(4, 30, TimeUnit.SECONDS, bossFactory),
+ new OrderedDownstreamThreadPoolExecutor(4, 30,
TimeUnit.SECONDS, workerFactory), maxWorkers);
} else {
- clientChannelFactory = new
NioClientSocketChannelFactory(Executors.newCachedThreadPool(bossFactory),
- Executors.newCachedThreadPool(workerFactory));
+ throw new
RuntimeException(Config.STORM_MESSAGING_NETTY_CLIENT_WORKER_THREADS + " should
be greater than 0");
}
clientScheduleService = new HashedWheelTimer(new
NettyRenameThreadFactory("client-schedule-service"));
{code}
Here are three messages received in sequence showing out-of-order sending from
the network perspective:
{code}
ASCII-encoded hex values of task message
|
payload length (long) |
| |
taskId (short) | |
| | |
<...> <.........> <............>
00:01:00:00:00:05:36:32:32:32:34:ff:37
00:01:00:00:00:05:36:32:32:32:36:ff:37
00:01:00:00:00:05:36:32:32:32:35:ff:37
{code}
Messages decode to "62224", "62226", "62225", and this corresponds with the
test output:
{code}
expected: (= req_msg resp_msg)
actual: (not (= "62225" "62226"))
at: test_runner.clj:105
expected: (= req_msg resp_msg)
actual: (not (= "62226" "62225"))
at: test_runner.clj:105
{code}
It does not happen every time, and the specific failures vary, but at least
this confirms that it is not an issue with the test code, and that the client
actually sent things out-of-order even when Netty's executor ensures ordering
of downstream events.
> netty-unit-tests/test-batch demonstrates out-of-order delivery
> --------------------------------------------------------------
>
> Key: STORM-996
> URL: https://issues.apache.org/jira/browse/STORM-996
> Project: Apache Storm
> Issue Type: Bug
> Affects Versions: 0.10.0
> Reporter: Derek Dagit
> Assignee: Derek Dagit
> Priority: Blocker
> Attachments:
> storm-996-OrderedDownstreamThreadPoolExecutor-failed.txt.bz2
>
>
> backtype.storm.messaging.netty-unit-test/test-batch
> One example of output. Similar things happen sporadically and vary widely by
> number of failed assertions.
> Tuples are not just skewed, but actually seem to come in out-of-order.
> {quote}
> actual: (not (= "66040" "66041"))
> at: test_runner.clj:105
> expected: (= req_msg resp_msg)
> actual: (not (= "66041" "66042"))
> at: test_runner.clj:105
> expected: (= req_msg resp_msg)
> actual: (not (= "66042" "66040"))
> at: test_runner.clj:105
> {quote}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)