[ 
https://issues.apache.org/jira/browse/STORM-1041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Scott Bessler updated STORM-1041:
---------------------------------
    Description: 
Topology:
 KafkaSpout (1 task/executor) -> bolt that does grouping (1 task/executor) -> 
bolt that does processing (176 tasks/executors)
 8 workers
 Using Netty

Sometimes when a worker dies (we've seen it happen due to an OOM or load from a 
co-located worker) it will try to restart on the same node, then 20s later 
shutdown and start on another node.

{code}
2015-09-10 08:05:41,131 -0700 INFO        backtype.storm.daemon.supervisor:0 - 
Launching worker with assignment 
#backtype.storm.daemon.supervisor.LocalAssignment{:storm-id 
"NoticeProcessorTopology-368-1441856754", :executors ([9 9] [41 41] [73 73] 
[105 105] [137 137] [169 169] [201 201] [17 17] [49 49] [81 81] [113 113] [145 
145] [177 177] [209 209] [25 25] [57 57] [89 89] [121 121] [153 153] [185 185] 
[217 217] [1 1] [33 33] [65 65] [97 97] [129 129] [161 161] [193 193] [225 
225])} for this supervisor 8a845b9b-adaa-4943-b6a6-68fdadcc5146 on port 6701 
with id 42a499b2-2c5c-43c2-be8a-a5b3f4f8a99e
2015-09-10 08:05:39,953 -0700 INFO        backtype.storm.daemon.supervisor:0 - 
Shutting down and clearing state for id 39c28ee2-abf9-4834-8b1f-0bd6933412e8. 
Current supervisor time: 1441897539. State: :disallowed, Heartbeat: 
#backtype.storm.daemon.common.WorkerHeartbeat{:time-secs 1441897539, :storm-id 
"NoticeProcessorTopology-368-1441856754", :executors #{[9 9] [41 41] [73 73] 
[105 105] [137 137] [169 169] [201 201] [17 17] [49 49] [81 81] [113 113] [145 
145] [177 177] [209 209] [25 25] [57 57] [89 89] [121 121] [153 153] [185 185] 
[217 217] [-1 -1] [1 1] [33 33] [65 65] [97 97] [129 129] [161 161] [193 193] 
[225 225]}, :port 6700}
2015-09-10 08:05:22,693 -0700 INFO        backtype.storm.daemon.supervisor:0 - 
Launching worker with assignment 
#backtype.storm.daemon.supervisor.LocalAssignment{:storm-id 
"NoticeProcessorTopology-368-1441856754", :executors ([9 9] [41 41] [73 73] 
[105 105] [137 137] [169 169] [201 201] [17 17] [49 49] [81 81] [113 113] [145 
145] [177 177] [209 209] [25 25] [57 57] [89 89] [121 121] [153 153] [185 185] 
[217 217] [1 1] [33 33] [65 65] [97 97] [129 129] [161 161] [193 193] [225 
225])} for this supervisor f26e1fae-03bd-4fa8-9868-6a54993f3c5d on port 6700 
with id 39c28ee2-abf9-4834-8b1f-0bd6933412e8
2015-09-10 08:05:21,588 -0700 INFO        backtype.storm.daemon.supervisor:0 - 
Shutting down and clearing state for id 4f0e4c22-6ccc-4d78-a20f-88bffb8def1d. 
Current supervisor time: 1441897521. State: :timed-out, Heartbeat: 
#backtype.storm.daemon.common.WorkerHeartbeat{:time-secs 1441897490, :storm-id 
"NoticeProcessorTopology-368-1441856754", :executors #{[9 9] [41 41] [73 73] 
[105 105] [137 137] [169 169] [201 201] [17 17] [49 49] [81 81] [113 113] [145 
145] [177 177] [209 209] [25 25] [57 57] [89 89] [121 121] [153 153] [185 185] 
[217 217] [-1 -1] [1 1] [33 33] [65 65] [97 97] [129 129] [161 161] [193 193] 
[225 225]}, :port 6700}
{code}

While the worker was dead and then killed, other workers have had netty drop 
messages. In theory these messages should timeout and be replayed. Our message 
timeout is 30s. 

{code}
2015-09-10 08:05:50,914 -0700 ERROR       b.storm.messaging.netty.Client:453 - 
dropping 1 message(s) destined for 
Netty-Client-usw2b-grunt-drone33-prod.amz.relateiq.com/10.30.101.36:6701
2015-09-10 08:05:44,904 -0700 ERROR       b.storm.messaging.netty.Client:453 - 
dropping 1 message(s) destined for 
Netty-Client-usw2b-grunt-drone33-prod.amz.relateiq.com/10.30.101.36:6701
2015-09-10 08:05:43,902 -0700 ERROR       b.storm.messaging.netty.Client:453 - 
dropping 1 message(s) destined for 
Netty-Client-usw2b-grunt-drone39-prod.amz.relateiq.com/10.30.101.5:6700
2015-09-10 08:05:27,873 -0700 ERROR       b.storm.messaging.netty.Client:453 - 
dropping 1 message(s) destined for 
Netty-Client-usw2b-grunt-drone39-prod.amz.relateiq.com/10.30.101.5:6700
2015-09-10 08:05:27,873 -0700 ERROR       b.storm.messaging.netty.Client:453 - 
dropping 1 message(s) destined for 
Netty-Client-usw2b-grunt-drone39-prod.amz.relateiq.com/10.30.101.5:6700
{code}

However these messages never timeout, and the MAX_SPOUT_PENDING has been 
reached, so no more tuples are emitted/processed.



  was:
Topology:
 KafkaSpout (1 task/executor) -> bolt that does grouping (1 task/executor) -> 
bolt that does processing (176 tasks/executors)
 8 workers
 Using Netty

Sometimes when a worker dies (we've seen it happen due to an OOM or load from a 
co-located worker) it will try to restart on the same node, then 20s later 
shutdown and start on another node.

While the worker was dead and then killed, other workers have had netty drop 
messages. In theory these messages should timeout and be replayed. Our message 
timeout is 30s. 

However these messages never timeout, and the MAX_SPOUT_PENDING has been 
reached, so no more tuples are emitted/processed.




> Topology with kafka spout stops processing
> ------------------------------------------
>
>                 Key: STORM-1041
>                 URL: https://issues.apache.org/jira/browse/STORM-1041
>             Project: Apache Storm
>          Issue Type: Bug
>    Affects Versions: 0.9.5
>            Reporter: Scott Bessler
>            Priority: Critical
>
> Topology:
>  KafkaSpout (1 task/executor) -> bolt that does grouping (1 task/executor) -> 
> bolt that does processing (176 tasks/executors)
>  8 workers
>  Using Netty
> Sometimes when a worker dies (we've seen it happen due to an OOM or load from 
> a co-located worker) it will try to restart on the same node, then 20s later 
> shutdown and start on another node.
> {code}
> 2015-09-10 08:05:41,131 -0700 INFO        backtype.storm.daemon.supervisor:0 
> - Launching worker with assignment 
> #backtype.storm.daemon.supervisor.LocalAssignment{:storm-id 
> "NoticeProcessorTopology-368-1441856754", :executors ([9 9] [41 41] [73 73] 
> [105 105] [137 137] [169 169] [201 201] [17 17] [49 49] [81 81] [113 113] 
> [145 145] [177 177] [209 209] [25 25] [57 57] [89 89] [121 121] [153 153] 
> [185 185] [217 217] [1 1] [33 33] [65 65] [97 97] [129 129] [161 161] [193 
> 193] [225 225])} for this supervisor 8a845b9b-adaa-4943-b6a6-68fdadcc5146 on 
> port 6701 with id 42a499b2-2c5c-43c2-be8a-a5b3f4f8a99e
> 2015-09-10 08:05:39,953 -0700 INFO        backtype.storm.daemon.supervisor:0 
> - Shutting down and clearing state for id 
> 39c28ee2-abf9-4834-8b1f-0bd6933412e8. Current supervisor time: 1441897539. 
> State: :disallowed, Heartbeat: 
> #backtype.storm.daemon.common.WorkerHeartbeat{:time-secs 1441897539, 
> :storm-id "NoticeProcessorTopology-368-1441856754", :executors #{[9 9] [41 
> 41] [73 73] [105 105] [137 137] [169 169] [201 201] [17 17] [49 49] [81 81] 
> [113 113] [145 145] [177 177] [209 209] [25 25] [57 57] [89 89] [121 121] 
> [153 153] [185 185] [217 217] [-1 -1] [1 1] [33 33] [65 65] [97 97] [129 129] 
> [161 161] [193 193] [225 225]}, :port 6700}
> 2015-09-10 08:05:22,693 -0700 INFO        backtype.storm.daemon.supervisor:0 
> - Launching worker with assignment 
> #backtype.storm.daemon.supervisor.LocalAssignment{:storm-id 
> "NoticeProcessorTopology-368-1441856754", :executors ([9 9] [41 41] [73 73] 
> [105 105] [137 137] [169 169] [201 201] [17 17] [49 49] [81 81] [113 113] 
> [145 145] [177 177] [209 209] [25 25] [57 57] [89 89] [121 121] [153 153] 
> [185 185] [217 217] [1 1] [33 33] [65 65] [97 97] [129 129] [161 161] [193 
> 193] [225 225])} for this supervisor f26e1fae-03bd-4fa8-9868-6a54993f3c5d on 
> port 6700 with id 39c28ee2-abf9-4834-8b1f-0bd6933412e8
> 2015-09-10 08:05:21,588 -0700 INFO        backtype.storm.daemon.supervisor:0 
> - Shutting down and clearing state for id 
> 4f0e4c22-6ccc-4d78-a20f-88bffb8def1d. Current supervisor time: 1441897521. 
> State: :timed-out, Heartbeat: 
> #backtype.storm.daemon.common.WorkerHeartbeat{:time-secs 1441897490, 
> :storm-id "NoticeProcessorTopology-368-1441856754", :executors #{[9 9] [41 
> 41] [73 73] [105 105] [137 137] [169 169] [201 201] [17 17] [49 49] [81 81] 
> [113 113] [145 145] [177 177] [209 209] [25 25] [57 57] [89 89] [121 121] 
> [153 153] [185 185] [217 217] [-1 -1] [1 1] [33 33] [65 65] [97 97] [129 129] 
> [161 161] [193 193] [225 225]}, :port 6700}
> {code}
> While the worker was dead and then killed, other workers have had netty drop 
> messages. In theory these messages should timeout and be replayed. Our 
> message timeout is 30s. 
> {code}
> 2015-09-10 08:05:50,914 -0700 ERROR       b.storm.messaging.netty.Client:453 
> - dropping 1 message(s) destined for 
> Netty-Client-usw2b-grunt-drone33-prod.amz.relateiq.com/10.30.101.36:6701
> 2015-09-10 08:05:44,904 -0700 ERROR       b.storm.messaging.netty.Client:453 
> - dropping 1 message(s) destined for 
> Netty-Client-usw2b-grunt-drone33-prod.amz.relateiq.com/10.30.101.36:6701
> 2015-09-10 08:05:43,902 -0700 ERROR       b.storm.messaging.netty.Client:453 
> - dropping 1 message(s) destined for 
> Netty-Client-usw2b-grunt-drone39-prod.amz.relateiq.com/10.30.101.5:6700
> 2015-09-10 08:05:27,873 -0700 ERROR       b.storm.messaging.netty.Client:453 
> - dropping 1 message(s) destined for 
> Netty-Client-usw2b-grunt-drone39-prod.amz.relateiq.com/10.30.101.5:6700
> 2015-09-10 08:05:27,873 -0700 ERROR       b.storm.messaging.netty.Client:453 
> - dropping 1 message(s) destined for 
> Netty-Client-usw2b-grunt-drone39-prod.amz.relateiq.com/10.30.101.5:6700
> {code}
> However these messages never timeout, and the MAX_SPOUT_PENDING has been 
> reached, so no more tuples are emitted/processed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to