[
https://issues.apache.org/jira/browse/STORM-1041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15263683#comment-15263683
]
L.Z. Xiang commented on STORM-1041:
-----------------------------------
I had encounter similar issue. I build a trident topology with 4 workers. At
the beginning, the topology is running well. But after a while, the
performance slow down quickly. so I put some track information to the
topology, and found something:
Sometimes, the time between two continuous batch is 10 minutes(always 10
min, it is weird). so, can any one tell me, what did the storm do in the 10min?
was it stuck?
Could any one offer some help?
> Topology with kafka spout stops processing
> ------------------------------------------
>
> Key: STORM-1041
> URL: https://issues.apache.org/jira/browse/STORM-1041
> Project: Apache Storm
> Issue Type: Bug
> Components: storm-kafka
> Affects Versions: 0.9.5
> Reporter: Scott Bessler
> Priority: Critical
>
> Topology:
> KafkaSpout (1 task/executor) -> bolt that does grouping (1 task/executor) ->
> bolt that does processing (176 tasks/executors)
> 8 workers
> Using Netty
> Sometimes when a worker dies (we've seen it happen due to an OOM or load from
> a co-located worker) it will try to restart on the same node, then 20s later
> shutdown and start on another node.
> {code}
> 2015-09-10 08:05:41,131 -0700 INFO backtype.storm.daemon.supervisor:0
> - Launching worker with assignment
> #backtype.storm.daemon.supervisor.LocalAssignment{:storm-id
> "NoticeProcessorTopology-368-1441856754", :executors ([9 9] [41 41] [73 73]
> [105 105] [137 137] [169 169] [201 201] [17 17] [49 49] [81 81] [113 113]
> [145 145] [177 177] [209 209] [25 25] [57 57] [89 89] [121 121] [153 153]
> [185 185] [217 217] [1 1] [33 33] [65 65] [97 97] [129 129] [161 161] [193
> 193] [225 225])} for this supervisor 8a845b9b-adaa-4943-b6a6-68fdadcc5146 on
> port 6701 with id 42a499b2-2c5c-43c2-be8a-a5b3f4f8a99e
> 2015-09-10 08:05:39,953 -0700 INFO backtype.storm.daemon.supervisor:0
> - Shutting down and clearing state for id
> 39c28ee2-abf9-4834-8b1f-0bd6933412e8. Current supervisor time: 1441897539.
> State: :disallowed, Heartbeat:
> #backtype.storm.daemon.common.WorkerHeartbeat{:time-secs 1441897539,
> :storm-id "NoticeProcessorTopology-368-1441856754", :executors #{[9 9] [41
> 41] [73 73] [105 105] [137 137] [169 169] [201 201] [17 17] [49 49] [81 81]
> [113 113] [145 145] [177 177] [209 209] [25 25] [57 57] [89 89] [121 121]
> [153 153] [185 185] [217 217] [-1 -1] [1 1] [33 33] [65 65] [97 97] [129 129]
> [161 161] [193 193] [225 225]}, :port 6700}
> 2015-09-10 08:05:22,693 -0700 INFO backtype.storm.daemon.supervisor:0
> - Launching worker with assignment
> #backtype.storm.daemon.supervisor.LocalAssignment{:storm-id
> "NoticeProcessorTopology-368-1441856754", :executors ([9 9] [41 41] [73 73]
> [105 105] [137 137] [169 169] [201 201] [17 17] [49 49] [81 81] [113 113]
> [145 145] [177 177] [209 209] [25 25] [57 57] [89 89] [121 121] [153 153]
> [185 185] [217 217] [1 1] [33 33] [65 65] [97 97] [129 129] [161 161] [193
> 193] [225 225])} for this supervisor f26e1fae-03bd-4fa8-9868-6a54993f3c5d on
> port 6700 with id 39c28ee2-abf9-4834-8b1f-0bd6933412e8
> 2015-09-10 08:05:21,588 -0700 INFO backtype.storm.daemon.supervisor:0
> - Shutting down and clearing state for id
> 4f0e4c22-6ccc-4d78-a20f-88bffb8def1d. Current supervisor time: 1441897521.
> State: :timed-out, Heartbeat:
> #backtype.storm.daemon.common.WorkerHeartbeat{:time-secs 1441897490,
> :storm-id "NoticeProcessorTopology-368-1441856754", :executors #{[9 9] [41
> 41] [73 73] [105 105] [137 137] [169 169] [201 201] [17 17] [49 49] [81 81]
> [113 113] [145 145] [177 177] [209 209] [25 25] [57 57] [89 89] [121 121]
> [153 153] [185 185] [217 217] [-1 -1] [1 1] [33 33] [65 65] [97 97] [129 129]
> [161 161] [193 193] [225 225]}, :port 6700}
> {code}
> While the worker was dead and then killed, other workers have had netty drop
> messages. In theory these messages should timeout and be replayed. Our
> message timeout is 30s.
> {code}
> 2015-09-10 08:05:50,914 -0700 ERROR b.storm.messaging.netty.Client:453
> - dropping 1 message(s) destined for
> Netty-Client-usw2b-grunt-drone33-prod.amz.relateiq.com/10.30.101.36:6701
> 2015-09-10 08:05:44,904 -0700 ERROR b.storm.messaging.netty.Client:453
> - dropping 1 message(s) destined for
> Netty-Client-usw2b-grunt-drone33-prod.amz.relateiq.com/10.30.101.36:6701
> 2015-09-10 08:05:43,902 -0700 ERROR b.storm.messaging.netty.Client:453
> - dropping 1 message(s) destined for
> Netty-Client-usw2b-grunt-drone39-prod.amz.relateiq.com/10.30.101.5:6700
> 2015-09-10 08:05:27,873 -0700 ERROR b.storm.messaging.netty.Client:453
> - dropping 1 message(s) destined for
> Netty-Client-usw2b-grunt-drone39-prod.amz.relateiq.com/10.30.101.5:6700
> 2015-09-10 08:05:27,873 -0700 ERROR b.storm.messaging.netty.Client:453
> - dropping 1 message(s) destined for
> Netty-Client-usw2b-grunt-drone39-prod.amz.relateiq.com/10.30.101.5:6700
> {code}
> However these messages never timeout, and the MAX_SPOUT_PENDING has been
> reached, so no more tuples are emitted/processed.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)