Milad Fatenejad created STORM-2108:
--------------------------------------

             Summary: Spout unable to recover after worker 
crashes...continuously see discarding messages errors...
                 Key: STORM-2108
                 URL: https://issues.apache.org/jira/browse/STORM-2108
             Project: Apache Storm
          Issue Type: Bug
          Components: storm-core
    Affects Versions: 1.0.2
         Environment: Ubuntu 14.04, Java 8, Docker 1.11.2
            Reporter: Milad Fatenejad


Hello:

We have a new situation that occurred after we upgraded to storm 1.0.2 (from 
0.9.2). We had a worker crash due to a bug in our code that caused a stack 
overflow exception. The supervisor detected the issue and restarted the worker 
as expected.

After the worker crashed, the many of the tuples the spout sends out 
continuously time out and our throughput slows to a crawl. The spout seems to 
send out tuples until it hits the max spout pending. Then some of the tuples 
time out and it sends the next batch.

We saw this exception in the spout log when the worker crashed:

2016-09-21T01:54:32,749 [refresh-connections-timer] 
[org.apache.storm.messaging.netty.Client] [INFO]> closing Netty Client 
Netty-Client-/10.103.16.14:31437
2016-09-21T01:54:32,750 [refresh-connections-timer] 
[org.apache.storm.messaging.netty.Client] [INFO]> waiting up to 600000 ms to 
send 0 pending messages to Netty-Client-/10.103.16.14:31437
2016-09-21T01:55:35,925 [Netty-server-localhost-31009-worker-1] 
[org.apache.storm.messaging.netty.StormServerHandler] [ERROR]> server errors in 
handling the request
java.io.IOException: Connection reset by peer
        at sun.nio.ch.FileDispatcherImpl.read0(Native Method) ~[?:1.8.0_77]
        at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) 
~[?:1.8.0_77]
        at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) ~[?:1.8.0_77]
        at sun.nio.ch.IOUtil.read(IOUtil.java:192) ~[?:1.8.0_77]
        at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380) 
~[?:1.8.0_77]
        at 
org.apache.storm.shade.org.jboss.netty.channel.socket.nio.NioWorker.read(NioWorker.java:64)
 [storm-core-1.0.2.jar:1.0.2]
        at 
org.apache.storm.shade.org.jboss.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:108)
 [storm-core-1.0.2.jar:1.0.2]
        at 
org.apache.storm.shade.org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:318)
 [storm-core-1.0.2.jar:1.0.2]
        at 
org.apache.storm.shade.org.jboss.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89)
 [storm-core-1.0.2.jar:1.0.2]
        at 
org.apache.storm.shade.org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
 [storm-core-1.0.2.jar:1.0.2]
        at 
org.apache.storm.shade.org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
 [storm-core-1.0.2.jar:1.0.2]
        at 
org.apache.storm.shade.org.jboss.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
 [storm-core-1.0.2.jar:1.0.2]
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) 
[?:1.8.0_77]
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) 
[?:1.8.0_77]
        at java.lang.Thread.run(Thread.java:745) [?:1.8.0_77]
2

And now we just continuously see these messages in the spout logs:

2016-09-21T02:03:35,513 [Thread-10-disruptor-worker-transfer-queue] 
[org.apache.storm.messaging.netty.Client] [ERROR]> discarding 1 messages 
because the Netty client to Netty-Client-/10.103.16.14:31437 is being closed
2016-09-21T02:03:35,644 [Thread-10-disruptor-worker-transfer-queue] 
[org.apache.storm.messaging.netty.Client] [ERROR]> discarding 1 messages 
because the Netty client to Netty-Client-/10.103.16.14:31437 is being closed
2016-09-21T02:03:35,774 [Thread-10-disruptor-worker-transfer-queue] 
[org.apache.storm.messaging.netty.Client] [ERROR]> discarding 1 messages 
because the Netty client to Netty-Client-/10.103.16.14:31437 is being closed
2016-09-21T02:03:35,817 [Thread-10-disruptor-worker-transfer-queue] 
[org.apache.storm.messaging.netty.Client] [ERROR]> discarding 1 messages 
because the Netty client to Netty-Client-/10.103.16.14:31437 is being closed
2016-09-21T02:03:35,849 [Thread-10-disruptor-worker-transfer-queue] 
[org.apache.storm.messaging.netty.Client] [ERROR]> discarding 1 messages 
because the Netty client to Netty-Client-/10.103.16.14:31437 is being closed
2016-09-21T02:03:36,073 [Thread-10-disruptor-worker-transfer-queue] 
[org.apache.storm.messaging.netty.Client] [ERROR]> discarding 1 messages 
because the Netty client to Netty-Client-/10.103.16.14:31437 is being closed
2016-09-21T02:03:36,141 [Thread-10-disruptor-worker-transfer-queue] 
[org.apache.storm.messaging.netty.Client] [ERROR]> discarding 1 messages 
because the Netty client to Netty-Client-/10.103.16.14:31437 is being closed
2016-09-21T02:03:36,169 [Thread-10-disruptor-worker-transfer-queue] 
[org.apache.storm.messaging.netty.Client] [ERROR]> discarding 1 messages 
because the Netty client to Netty-Client-/10.103.16.14:31437 is being closed
2016-09-21T02:03:36,340 [Thread-10-disruptor-worker-transfer-queue] 
[org.apache.storm.messaging.netty.Client] [ERROR]> discarding 1 messages 
because the Netty client to Netty-Client-/10.103.16.14:31437 is being closed
2016-09-21T02:03:36,365 [Thread-10-disruptor-worker-transfer-queue] 
[org.apache.storm.messaging.netty.Client] [ERROR]> discarding 1 messages 
because the Netty client to Netty-Client-/10.103.16.14:31437 is being closed
2016-09-21T02:03:36,416 [Thread-10-disruptor-worker-transfer-queue] 
[org.apache.storm.messaging.netty.Client] [ERROR]> discarding 1 messages 
because the Netty client to Netty-Client-/10.103.16.14:31437 is being closed
2016-09-21T02:03:36,560 [Thread-10-disruptor-worker-transfer-queue] 
[org.apache.storm.messaging.netty.Client] [ERROR]> discarding 1 messages 
because the Netty client to Netty-Client-/10.103.16.14:31437 is being closed
2016-09-21T02:03:36,607 [Thread-10-disruptor-worker-transfer-queue] 
[org.apache.storm.messaging.netty.Client] [ERROR]> discarding 1 messages 
because the Netty client to Netty-Client-/10.103.16.14:31437 is being closed
2016-09-21T02:03:36,660 [Thread-10-disruptor-worker-transfer-queue] 
[org.apache.storm.messaging.netty.Client] [ERROR]> discarding 1 messages 
because the Netty client to Netty-Client-/10.103.16.14:31437 is being closed
2016-09-21T02:03:36,865 [Thread-10-disruptor-worker-transfer-queue] 
[org.apache.storm.messaging.netty.Client] [ERROR]> discarding 1 messages 
because the Netty client to Netty-Client-/10.103.16.14:31437 is being closed
2016-09-21T02:03:36,894 [Thread-10-disruptor-worker-transfer-queue] 
[org.apache.storm.messaging.netty.Client] [ERROR]> discarding 1 messages 
because the Netty client to Netty-Client-/10.103.16.14:31437 is being closed
2016-09-21T02:03:37,026 [Thread-10-disruptor-worker-transfer-queue] 
[org.apache.storm.messaging.netty.Client] [ERROR]> discarding 1 messages 
because the Netty client to Netty-Client-/10.103.16.14:31437 is being closed
2016-09-21T02:03:37,051 [Thread-10-disruptor-worker-transfer-queue] 
[org.apache.storm.messaging.netty.Client] [ERROR]> discarding 1 messages 
because the Netty client to Netty-Client-/10.103.16.14:31437 is being closed
2016-09-21T02:03:37,065 [Thread-10-disruptor-worker-transfer-queue] 
[org.apache.storm.messaging.netty.Client] [ERROR]> discarding 1 messages 
because the Netty client to Netty-Client-/10.103.16.14:31437 is being closed
2016-09-21T02:03:37,219 [Thread-10-disruptor-worker-transfer-queue] 
[org.apache.storm.messaging.netty.Client] [ERROR]> discarding 1 messages 
because the Netty client to Netty-Client-/10.103.16.14:31437 is being closed

The worker that died (10.103.16.14.31437) was restarted by the supervisor, but 
I don't see any log messages in the logs indicating that it is receiving any 
tuples. The "is being closed" messages in the spout logs make me think that 
storm has failed to close its connection.

This has happened to us repeatedly since the upgrade. Does anyone have any 
suggestions about how to fix this issue? Also, I originally thought it might be 
related to STORM-1560, but I don't see the exception that is mentioned in that 
ticket.

Thanks, and I would greatly appreciate any help




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to