[ 
https://issues.apache.org/jira/browse/STORM-996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14722708#comment-14722708
 ] 

Derek Dagit commented on STORM-996:
-----------------------------------

If I enable the debugging statement that logs how many messages are sent when 
sending a batch, I can no longer reproduce the error.  As soon as I disable 
this statement, the error reappears.

So I thought I would quickly check the actual network traffic.  I confirmed by 
inspecting packets sent during the netty-unit-test/test-batch that the task 
messages were actually sent out-of-order, and it is not a matter of the test 
code incorrectly handling what had been received in-order.

Patch:

{code}
diff --git a/storm-core/src/jvm/backtype/storm/messaging/netty/Context.java 
b/storm-core/src/jvm/backtype/storm/messaging/netty/Context.java
index 5d27a16..0b59ec6 100644
--- a/storm-core/src/jvm/backtype/storm/messaging/netty/Context.java
+++ b/storm-core/src/jvm/backtype/storm/messaging/netty/Context.java
@@ -18,6 +18,7 @@
 package backtype.storm.messaging.netty;
 
 import org.jboss.netty.channel.socket.nio.NioClientSocketChannelFactory;
+import org.jboss.netty.handler.execution.OrderedDownstreamThreadPoolExecutor;
 import org.jboss.netty.util.HashedWheelTimer;
 import org.slf4j.Logger;
 import org.slf4j.LoggerFactory;
@@ -56,11 +57,10 @@ public class Context implements IContext {
                ThreadFactory bossFactory = new 
NettyRenameThreadFactory("client" + "-boss");
         ThreadFactory workerFactory = new NettyRenameThreadFactory("client" + 
"-worker");
         if (maxWorkers > 0) {
-            clientChannelFactory = new 
NioClientSocketChannelFactory(Executors.newCachedThreadPool(bossFactory),
-                    Executors.newCachedThreadPool(workerFactory), maxWorkers);
+            clientChannelFactory = new NioClientSocketChannelFactory(new 
OrderedDownstreamThreadPoolExecutor(4, 30, TimeUnit.SECONDS, bossFactory),
+                    new OrderedDownstreamThreadPoolExecutor(4, 30, 
TimeUnit.SECONDS, workerFactory), maxWorkers);
         } else {
-            clientChannelFactory = new 
NioClientSocketChannelFactory(Executors.newCachedThreadPool(bossFactory),
-                    Executors.newCachedThreadPool(workerFactory));
+            throw new 
RuntimeException(Config.STORM_MESSAGING_NETTY_CLIENT_WORKER_THREADS + " should 
be greater than 0");
         }
         
         clientScheduleService = new HashedWheelTimer(new 
NettyRenameThreadFactory("client-schedule-service"));
{code}


Here are three messages received in sequence showing out-of-order sending from 
the network perspective:

{code}
ASCII-encoded hex values of task message 
                               |
   payload length (long)       |
                |              |
taskId (short)  |              |
    |           |              |
<...> <.........> <............>
00:01:00:00:00:05:36:32:32:32:34:ff:37
00:01:00:00:00:05:36:32:32:32:36:ff:37
00:01:00:00:00:05:36:32:32:32:35:ff:37
{code}

Messages decode to "62224", "62226", "62225", and this corresponds with the 
test output:


{code}
  expected: (= req_msg resp_msg)
  actual: (not (= "62225" "62226"))
      at: test_runner.clj:105
  expected: (= req_msg resp_msg)
  actual: (not (= "62226" "62225"))
      at: test_runner.clj:105
{code}

It does not happen every time, and the specific failures vary, but at least 
this confirms that it is not an issue with the test code, and that the client 
actually sent things out-of-order even when Netty's executor ensures ordering 
of downstream events.

> netty-unit-tests/test-batch demonstrates out-of-order delivery
> --------------------------------------------------------------
>
>                 Key: STORM-996
>                 URL: https://issues.apache.org/jira/browse/STORM-996
>             Project: Apache Storm
>          Issue Type: Bug
>    Affects Versions: 0.10.0
>            Reporter: Derek Dagit
>            Assignee: Derek Dagit
>            Priority: Blocker
>         Attachments: 
> storm-996-OrderedDownstreamThreadPoolExecutor-failed.txt.bz2
>
>
> backtype.storm.messaging.netty-unit-test/test-batch
> One example of output.  Similar things happen sporadically and vary widely by 
> number of failed assertions.
> Tuples are not just skewed, but actually seem to come in out-of-order.
> {quote}
> actual: (not (= "66040" "66041"))
> at: test_runner.clj:105
> expected: (= req_msg resp_msg)
> actual: (not (= "66041" "66042"))
> at: test_runner.clj:105
> expected: (= req_msg resp_msg)
> actual: (not (= "66042" "66040"))
> at: test_runner.clj:105
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to