[ 
https://issues.apache.org/jira/browse/STORM-1041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14745013#comment-14745013
 ] 

Abhishek Agarwal commented on STORM-1041:
-----------------------------------------

Can you take the stack trace after the topology is stuck? Attach the stack 
trace here afterwards.  

> Topology with kafka spout stops processing
> ------------------------------------------
>
>                 Key: STORM-1041
>                 URL: https://issues.apache.org/jira/browse/STORM-1041
>             Project: Apache Storm
>          Issue Type: Bug
>    Affects Versions: 0.9.5
>            Reporter: Scott Bessler
>            Priority: Critical
>
> Topology:
>  KafkaSpout (1 task/executor) -> bolt that does grouping (1 task/executor) -> 
> bolt that does processing (176 tasks/executors)
>  8 workers
>  Using Netty
> Sometimes when a worker dies (we've seen it happen due to an OOM or load from 
> a co-located worker) it will try to restart on the same node, then 20s later 
> shutdown and start on another node.
> {code}
> 2015-09-10 08:05:41,131 -0700 INFO        backtype.storm.daemon.supervisor:0 
> - Launching worker with assignment 
> #backtype.storm.daemon.supervisor.LocalAssignment{:storm-id 
> "NoticeProcessorTopology-368-1441856754", :executors ([9 9] [41 41] [73 73] 
> [105 105] [137 137] [169 169] [201 201] [17 17] [49 49] [81 81] [113 113] 
> [145 145] [177 177] [209 209] [25 25] [57 57] [89 89] [121 121] [153 153] 
> [185 185] [217 217] [1 1] [33 33] [65 65] [97 97] [129 129] [161 161] [193 
> 193] [225 225])} for this supervisor 8a845b9b-adaa-4943-b6a6-68fdadcc5146 on 
> port 6701 with id 42a499b2-2c5c-43c2-be8a-a5b3f4f8a99e
> 2015-09-10 08:05:39,953 -0700 INFO        backtype.storm.daemon.supervisor:0 
> - Shutting down and clearing state for id 
> 39c28ee2-abf9-4834-8b1f-0bd6933412e8. Current supervisor time: 1441897539. 
> State: :disallowed, Heartbeat: 
> #backtype.storm.daemon.common.WorkerHeartbeat{:time-secs 1441897539, 
> :storm-id "NoticeProcessorTopology-368-1441856754", :executors #{[9 9] [41 
> 41] [73 73] [105 105] [137 137] [169 169] [201 201] [17 17] [49 49] [81 81] 
> [113 113] [145 145] [177 177] [209 209] [25 25] [57 57] [89 89] [121 121] 
> [153 153] [185 185] [217 217] [-1 -1] [1 1] [33 33] [65 65] [97 97] [129 129] 
> [161 161] [193 193] [225 225]}, :port 6700}
> 2015-09-10 08:05:22,693 -0700 INFO        backtype.storm.daemon.supervisor:0 
> - Launching worker with assignment 
> #backtype.storm.daemon.supervisor.LocalAssignment{:storm-id 
> "NoticeProcessorTopology-368-1441856754", :executors ([9 9] [41 41] [73 73] 
> [105 105] [137 137] [169 169] [201 201] [17 17] [49 49] [81 81] [113 113] 
> [145 145] [177 177] [209 209] [25 25] [57 57] [89 89] [121 121] [153 153] 
> [185 185] [217 217] [1 1] [33 33] [65 65] [97 97] [129 129] [161 161] [193 
> 193] [225 225])} for this supervisor f26e1fae-03bd-4fa8-9868-6a54993f3c5d on 
> port 6700 with id 39c28ee2-abf9-4834-8b1f-0bd6933412e8
> 2015-09-10 08:05:21,588 -0700 INFO        backtype.storm.daemon.supervisor:0 
> - Shutting down and clearing state for id 
> 4f0e4c22-6ccc-4d78-a20f-88bffb8def1d. Current supervisor time: 1441897521. 
> State: :timed-out, Heartbeat: 
> #backtype.storm.daemon.common.WorkerHeartbeat{:time-secs 1441897490, 
> :storm-id "NoticeProcessorTopology-368-1441856754", :executors #{[9 9] [41 
> 41] [73 73] [105 105] [137 137] [169 169] [201 201] [17 17] [49 49] [81 81] 
> [113 113] [145 145] [177 177] [209 209] [25 25] [57 57] [89 89] [121 121] 
> [153 153] [185 185] [217 217] [-1 -1] [1 1] [33 33] [65 65] [97 97] [129 129] 
> [161 161] [193 193] [225 225]}, :port 6700}
> {code}
> While the worker was dead and then killed, other workers have had netty drop 
> messages. In theory these messages should timeout and be replayed. Our 
> message timeout is 30s. 
> {code}
> 2015-09-10 08:05:50,914 -0700 ERROR       b.storm.messaging.netty.Client:453 
> - dropping 1 message(s) destined for 
> Netty-Client-usw2b-grunt-drone33-prod.amz.relateiq.com/10.30.101.36:6701
> 2015-09-10 08:05:44,904 -0700 ERROR       b.storm.messaging.netty.Client:453 
> - dropping 1 message(s) destined for 
> Netty-Client-usw2b-grunt-drone33-prod.amz.relateiq.com/10.30.101.36:6701
> 2015-09-10 08:05:43,902 -0700 ERROR       b.storm.messaging.netty.Client:453 
> - dropping 1 message(s) destined for 
> Netty-Client-usw2b-grunt-drone39-prod.amz.relateiq.com/10.30.101.5:6700
> 2015-09-10 08:05:27,873 -0700 ERROR       b.storm.messaging.netty.Client:453 
> - dropping 1 message(s) destined for 
> Netty-Client-usw2b-grunt-drone39-prod.amz.relateiq.com/10.30.101.5:6700
> 2015-09-10 08:05:27,873 -0700 ERROR       b.storm.messaging.netty.Client:453 
> - dropping 1 message(s) destined for 
> Netty-Client-usw2b-grunt-drone39-prod.amz.relateiq.com/10.30.101.5:6700
> {code}
> However these messages never timeout, and the MAX_SPOUT_PENDING has been 
> reached, so no more tuples are emitted/processed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to