[
https://issues.apache.org/jira/browse/STORM-1041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14745013#comment-14745013
]
Abhishek Agarwal commented on STORM-1041:
-----------------------------------------
Can you take the stack trace after the topology is stuck? Attach the stack
trace here afterwards.
> Topology with kafka spout stops processing
> ------------------------------------------
>
> Key: STORM-1041
> URL: https://issues.apache.org/jira/browse/STORM-1041
> Project: Apache Storm
> Issue Type: Bug
> Affects Versions: 0.9.5
> Reporter: Scott Bessler
> Priority: Critical
>
> Topology:
> KafkaSpout (1 task/executor) -> bolt that does grouping (1 task/executor) ->
> bolt that does processing (176 tasks/executors)
> 8 workers
> Using Netty
> Sometimes when a worker dies (we've seen it happen due to an OOM or load from
> a co-located worker) it will try to restart on the same node, then 20s later
> shutdown and start on another node.
> {code}
> 2015-09-10 08:05:41,131 -0700 INFO backtype.storm.daemon.supervisor:0
> - Launching worker with assignment
> #backtype.storm.daemon.supervisor.LocalAssignment{:storm-id
> "NoticeProcessorTopology-368-1441856754", :executors ([9 9] [41 41] [73 73]
> [105 105] [137 137] [169 169] [201 201] [17 17] [49 49] [81 81] [113 113]
> [145 145] [177 177] [209 209] [25 25] [57 57] [89 89] [121 121] [153 153]
> [185 185] [217 217] [1 1] [33 33] [65 65] [97 97] [129 129] [161 161] [193
> 193] [225 225])} for this supervisor 8a845b9b-adaa-4943-b6a6-68fdadcc5146 on
> port 6701 with id 42a499b2-2c5c-43c2-be8a-a5b3f4f8a99e
> 2015-09-10 08:05:39,953 -0700 INFO backtype.storm.daemon.supervisor:0
> - Shutting down and clearing state for id
> 39c28ee2-abf9-4834-8b1f-0bd6933412e8. Current supervisor time: 1441897539.
> State: :disallowed, Heartbeat:
> #backtype.storm.daemon.common.WorkerHeartbeat{:time-secs 1441897539,
> :storm-id "NoticeProcessorTopology-368-1441856754", :executors #{[9 9] [41
> 41] [73 73] [105 105] [137 137] [169 169] [201 201] [17 17] [49 49] [81 81]
> [113 113] [145 145] [177 177] [209 209] [25 25] [57 57] [89 89] [121 121]
> [153 153] [185 185] [217 217] [-1 -1] [1 1] [33 33] [65 65] [97 97] [129 129]
> [161 161] [193 193] [225 225]}, :port 6700}
> 2015-09-10 08:05:22,693 -0700 INFO backtype.storm.daemon.supervisor:0
> - Launching worker with assignment
> #backtype.storm.daemon.supervisor.LocalAssignment{:storm-id
> "NoticeProcessorTopology-368-1441856754", :executors ([9 9] [41 41] [73 73]
> [105 105] [137 137] [169 169] [201 201] [17 17] [49 49] [81 81] [113 113]
> [145 145] [177 177] [209 209] [25 25] [57 57] [89 89] [121 121] [153 153]
> [185 185] [217 217] [1 1] [33 33] [65 65] [97 97] [129 129] [161 161] [193
> 193] [225 225])} for this supervisor f26e1fae-03bd-4fa8-9868-6a54993f3c5d on
> port 6700 with id 39c28ee2-abf9-4834-8b1f-0bd6933412e8
> 2015-09-10 08:05:21,588 -0700 INFO backtype.storm.daemon.supervisor:0
> - Shutting down and clearing state for id
> 4f0e4c22-6ccc-4d78-a20f-88bffb8def1d. Current supervisor time: 1441897521.
> State: :timed-out, Heartbeat:
> #backtype.storm.daemon.common.WorkerHeartbeat{:time-secs 1441897490,
> :storm-id "NoticeProcessorTopology-368-1441856754", :executors #{[9 9] [41
> 41] [73 73] [105 105] [137 137] [169 169] [201 201] [17 17] [49 49] [81 81]
> [113 113] [145 145] [177 177] [209 209] [25 25] [57 57] [89 89] [121 121]
> [153 153] [185 185] [217 217] [-1 -1] [1 1] [33 33] [65 65] [97 97] [129 129]
> [161 161] [193 193] [225 225]}, :port 6700}
> {code}
> While the worker was dead and then killed, other workers have had netty drop
> messages. In theory these messages should timeout and be replayed. Our
> message timeout is 30s.
> {code}
> 2015-09-10 08:05:50,914 -0700 ERROR b.storm.messaging.netty.Client:453
> - dropping 1 message(s) destined for
> Netty-Client-usw2b-grunt-drone33-prod.amz.relateiq.com/10.30.101.36:6701
> 2015-09-10 08:05:44,904 -0700 ERROR b.storm.messaging.netty.Client:453
> - dropping 1 message(s) destined for
> Netty-Client-usw2b-grunt-drone33-prod.amz.relateiq.com/10.30.101.36:6701
> 2015-09-10 08:05:43,902 -0700 ERROR b.storm.messaging.netty.Client:453
> - dropping 1 message(s) destined for
> Netty-Client-usw2b-grunt-drone39-prod.amz.relateiq.com/10.30.101.5:6700
> 2015-09-10 08:05:27,873 -0700 ERROR b.storm.messaging.netty.Client:453
> - dropping 1 message(s) destined for
> Netty-Client-usw2b-grunt-drone39-prod.amz.relateiq.com/10.30.101.5:6700
> 2015-09-10 08:05:27,873 -0700 ERROR b.storm.messaging.netty.Client:453
> - dropping 1 message(s) destined for
> Netty-Client-usw2b-grunt-drone39-prod.amz.relateiq.com/10.30.101.5:6700
> {code}
> However these messages never timeout, and the MAX_SPOUT_PENDING has been
> reached, so no more tuples are emitted/processed.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)