[ 
https://issues.apache.org/jira/browse/STORM-794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim updated STORM-794:
-------------------------------
    Description: 
I met an issue from Trident Topology in production env.

Normally, when we kill a topology via UI, Nimbus changes Topology status to 
"killed", and when Spout determines new status, it becomes deactivated so bolts 
can handle remain tuples within wait-time.
AFAIK that's how Storm guarantees graceful shutdown.

But, Trident Topology seems not handle "deactivate" while we try shutdown 
topology gracefully.
MasterBatchCoordinator never stops making next transaction, so Trident Spout 
never stops emitting, bolts (function) always take care of tuples.

Topology setting
- 1 worker, 1 acker
- max spout pending: 1
- TOPOLOGY_TRIDENT_BATCH_EMIT_INTERVAL_MILLIS : 5
-- It may be weird but MasterBatchCoordinator's default value is 1

* Nimbus log

{code}
2015-04-20 09:59:07.954 INFO  [pool-5-thread-41][nimbus] Delaying event :remove 
for 120 secs for BFDC-topology-DynamicCollect-68c9d7b4-72-1429491015
...
2015-04-20 09:59:07.955 INFO  [pool-5-thread-41][nimbus] Updated 
BFDC-topology-DynamicCollect-68c9d7b4-72-1429491015 with status {:type :killed, 
:kill-time-secs 120}
...
2015-04-20 10:01:07.956 INFO  [timer][nimbus] Killing topology: 
BFDC-topology-DynamicCollect-68c9d7b4-72-1429491015
...
2015-04-20 10:01:14.448 INFO  [timer][nimbus] Cleaning up 
BFDC-topology-DynamicCollect-68c9d7b4-72-1429491015
{code}

* Supervisor log

{code}
2015-04-20 10:01:07.960 INFO  [Thread-1][supervisor] Removing code for storm id 
BFDC-topology-DynamicCollect-68c9d7b4-72-1429491015
2015-04-20 10:01:07.962 INFO  [Thread-2][supervisor] Shutting down and clearing 
state for id 9719259e-528c-4336-abf9-592c1bb9a00b. Current supervisor time: 
1429491667. State: :disallowed, Heartbeat: 
#backtype.storm.daemon.common.WorkerHeartbeat{:time-secs 1429491667, :storm-id 
"BFDC-topology-DynamicCollect-68c9d7b4-72-1429491015", :executors #{[2 2] [3 3] 
[4 4] [5 5] [6 6] [7 7] [8 8] [9 9] [10 10] [11 11] [12 12] [13 13] [14 14] [-1 
-1] [1 1]}, :port 6706}
2015-04-20 10:01:07.962 INFO  [Thread-2][supervisor] Shutting down 
5bc084a2-b668-4610-86f6-9b93304d40a8:9719259e-528c-4336-abf9-592c1bb9a00b
2015-04-20 10:01:08.974 INFO  [Thread-2][supervisor] Shut down 
5bc084a2-b668-4610-86f6-9b93304d40a8:9719259e-528c-4336-abf9-592c1bb9a00b
{code}

* Worker log

{code}
2015-04-20 10:01:07.985 INFO  [Thread-33][worker] Shutting down worker 
BFDC-topology-DynamicCollect-68c9d7b4-72-1429491015 
5bc084a2-b668-4610-86f6-9b93304d40a8 6706
2015-04-20 10:01:07.985 INFO  [Thread-33][worker] Shutting down receive thread
2015-04-20 10:01:07.988 WARN  [Thread-33][ExponentialBackoffRetry] maxRetries 
too large (300). Pinning to 29
2015-04-20 10:01:07.988 INFO  [Thread-33][StormBoundedExponentialBackoffRetry] 
The baseSleepTimeMs [100] the maxSleepTimeMs [1000] the maxRetries [300]
2015-04-20 10:01:07.988 INFO  [Thread-33][Client] New Netty Client, connect to 
localhost, 6706, config: , buffer_size: 5242880
2015-04-20 10:01:07.991 INFO  [client-schedule-service-1][Client] Reconnect 
started for Netty-Client-localhost/127.0.0.1:6706... [0]
2015-04-20 10:01:07.996 INFO  [Thread-33][loader] Shutting down 
receiving-thread: [BFDC-topology-DynamicCollect-68c9d7b4-72-1429491015, 6706]
...
2015-04-20 10:01:08.044 INFO  [Thread-33][Client] Closing Netty Client 
Netty-Client-localhost/127.0.0.1:6706
2015-04-20 10:01:08.044 INFO  [Thread-33][Client] Waiting for pending batchs to 
be sent with Netty-Client-localhost/127.0.0.1:6706..., timeout: 600000ms, 
pendings: 1
{code}

I found activating log, but cannot find deactivating log.

{code}
2015-04-20 09:50:24.556 INFO  [Thread-30-$mastercoord-bg0][executor] Activating 
spout $mastercoord-bg0:(1)
{code}

Please note that it doesn't work when I just push button to "deactivate" 
topology via UI.

We're changing our Topology to normal Spout-Bolt, but personally I'd like to 
see it resolved. 

  was:
I met an issue from Trident Topology in production env.

Normally, when we kill a topology via UI, Nimbus changes Topology status to 
"killed", and when Spout determines new status, it becomes deactivated so bolts 
can handle remain tuples within wait-time.
AFAIK that's how Storm guarantees graceful shutdown.

But, Trident Topology seems not handle "deactivate" while we try shutdown 
topology gracefully.
MasterBatchCoordinator never stops making next transaction, so Trident Spout 
never stops emitting, bolts (function) always take care of tuples.

Topology setting
- 1 worker, 1 acker
- max spout pending: 1
- TOPOLOGY_TRIDENT_BATCH_EMIT_INTERVAL_MILLIS : 5
-- It may be weird but MasterBatchCoordinator's default value is 1

* Nimbus log

{code}
2015-04-20 09:59:07.954 INFO  [pool-5-thread-41][nimbus] Delaying event :remove 
for 120 secs for BFDC-topology-DynamicCollect-68c9d7b4-72-1429491015
...
2015-04-20 09:59:07.955 INFO  [pool-5-thread-41][nimbus] Updated 
BFDC-topology-DynamicCollect-68c9d7b4-72-1429491015 with status {:type :killed, 
:kill-time-secs 120}
...
2015-04-20 10:01:07.956 INFO  [timer][nimbus] Killing topology: 
BFDC-topology-DynamicCollect-68c9d7b4-72-1429491015
...
2015-04-20 10:01:14.448 INFO  [timer][nimbus] Cleaning up 
BFDC-topology-DynamicCollect-68c9d7b4-72-1429491015
{code}

* Supervisor log

{code}
2015-04-20 10:01:07.960 INFO  [Thread-1][supervisor] Removing code for storm id 
BFDC-topology-DynamicCollect-68c9d7b4-72-1429491015
2015-04-20 10:01:07.962 INFO  [Thread-2][supervisor] Shutting down and clearing 
state for id 9719259e-528c-4336-abf9-592c1bb9a00b. Current supervisor time: 
1429491667. State: :disallowed, Heartbeat: 
#backtype.storm.daemon.common.WorkerHeartbeat{:time-secs 1429491667, :storm-id 
"BFDC-topology-DynamicCollect-68c9d7b4-72-1429491015", :executors #{[2 2] [3 3] 
[4 4] [5 5] [6 6] [7 7] [8 8] [9 9] [10 10] [11 11] [12 12] [13 13] [14 14] [-1 
-1] [1 1]}, :port 6706}
2015-04-20 10:01:07.962 INFO  [Thread-2][supervisor] Shutting down 
5bc084a2-b668-4610-86f6-9b93304d40a8:9719259e-528c-4336-abf9-592c1bb9a00b
2015-04-20 10:01:08.974 INFO  [Thread-2][supervisor] Shut down 
5bc084a2-b668-4610-86f6-9b93304d40a8:9719259e-528c-4336-abf9-592c1bb9a00b
{code}

* Worker log

{code}
worker-6706.log.2015-04-20:2015-04-20 10:01:11.313 INFO  [Thread-33][worker] 
Shutting down worker BFDC-topology-DynamicCollect-95cdc86f-74-1429491015 
e121261a-fcf3-4e0b-870a-fbfe02280f8e 6706
...
worker-6706.log.2015-04-20-2015-04-20 10:01:11.317 INFO  [Thread-33][worker] 
Shutting down receive thread
...
worker-6706.log.2015-04-20-2015-04-20 10:01:11.324 WARN  
[Thread-33][ExponentialBackoffRetry] maxRetries too large (300). Pinning to 29
worker-6706.log.2015-04-20-2015-04-20 10:01:11.324 INFO  
[Thread-33][StormBoundedExponentialBackoffRetry] The baseSleepTimeMs [100] the 
maxSleepTimeMs [1000] the maxRetries [300]
worker-6706.log.2015-04-20-2015-04-20 10:01:11.324 INFO  [Thread-33][Client] 
New Netty Client, connect to localhost, 6706, config: , buffer_size: 5242880
...
worker-6706.log.2015-04-20-2015-04-20 10:01:11.327 INFO  
[client-schedule-service-1][Client] Reconnect started for 
Netty-Client-localhost/127.0.0.1:6706... [0]
worker-6706.log.2015-04-20-2015-04-20 10:01:11.332 INFO  [Thread-33][loader] 
Shutting down receiving-thread: 
[BFDC-topology-DynamicCollect-95cdc86f-74-1429491015, 6706]
{code}

I found activating log, but cannot find deactivating log.

{code}
2015-04-20 09:50:24.556 INFO  [Thread-30-$mastercoord-bg0][executor] Activating 
spout $mastercoord-bg0:(1)
{code}

Please note that it doesn't work when I just push button to "deactivate" 
topology via UI.

We're changing our Topology to normal Spout-Bolt, but personally I'd like to 
see it resolved. 


> Trident Topology doesn't handle deactivate during graceful shutdown
> -------------------------------------------------------------------
>
>                 Key: STORM-794
>                 URL: https://issues.apache.org/jira/browse/STORM-794
>             Project: Apache Storm
>          Issue Type: Bug
>    Affects Versions: 0.9.3
>            Reporter: Jungtaek Lim
>
> I met an issue from Trident Topology in production env.
> Normally, when we kill a topology via UI, Nimbus changes Topology status to 
> "killed", and when Spout determines new status, it becomes deactivated so 
> bolts can handle remain tuples within wait-time.
> AFAIK that's how Storm guarantees graceful shutdown.
> But, Trident Topology seems not handle "deactivate" while we try shutdown 
> topology gracefully.
> MasterBatchCoordinator never stops making next transaction, so Trident Spout 
> never stops emitting, bolts (function) always take care of tuples.
> Topology setting
> - 1 worker, 1 acker
> - max spout pending: 1
> - TOPOLOGY_TRIDENT_BATCH_EMIT_INTERVAL_MILLIS : 5
> -- It may be weird but MasterBatchCoordinator's default value is 1
> * Nimbus log
> {code}
> 2015-04-20 09:59:07.954 INFO  [pool-5-thread-41][nimbus] Delaying event 
> :remove for 120 secs for BFDC-topology-DynamicCollect-68c9d7b4-72-1429491015
> ...
> 2015-04-20 09:59:07.955 INFO  [pool-5-thread-41][nimbus] Updated 
> BFDC-topology-DynamicCollect-68c9d7b4-72-1429491015 with status {:type 
> :killed, :kill-time-secs 120}
> ...
> 2015-04-20 10:01:07.956 INFO  [timer][nimbus] Killing topology: 
> BFDC-topology-DynamicCollect-68c9d7b4-72-1429491015
> ...
> 2015-04-20 10:01:14.448 INFO  [timer][nimbus] Cleaning up 
> BFDC-topology-DynamicCollect-68c9d7b4-72-1429491015
> {code}
> * Supervisor log
> {code}
> 2015-04-20 10:01:07.960 INFO  [Thread-1][supervisor] Removing code for storm 
> id BFDC-topology-DynamicCollect-68c9d7b4-72-1429491015
> 2015-04-20 10:01:07.962 INFO  [Thread-2][supervisor] Shutting down and 
> clearing state for id 9719259e-528c-4336-abf9-592c1bb9a00b. Current 
> supervisor time: 1429491667. State: :disallowed, Heartbeat: 
> #backtype.storm.daemon.common.WorkerHeartbeat{:time-secs 1429491667, 
> :storm-id "BFDC-topology-DynamicCollect-68c9d7b4-72-1429491015", :executors 
> #{[2 2] [3 3] [4 4] [5 5] [6 6] [7 7] [8 8] [9 9] [10 10] [11 11] [12 12] [13 
> 13] [14 14] [-1 -1] [1 1]}, :port 6706}
> 2015-04-20 10:01:07.962 INFO  [Thread-2][supervisor] Shutting down 
> 5bc084a2-b668-4610-86f6-9b93304d40a8:9719259e-528c-4336-abf9-592c1bb9a00b
> 2015-04-20 10:01:08.974 INFO  [Thread-2][supervisor] Shut down 
> 5bc084a2-b668-4610-86f6-9b93304d40a8:9719259e-528c-4336-abf9-592c1bb9a00b
> {code}
> * Worker log
> {code}
> 2015-04-20 10:01:07.985 INFO  [Thread-33][worker] Shutting down worker 
> BFDC-topology-DynamicCollect-68c9d7b4-72-1429491015 
> 5bc084a2-b668-4610-86f6-9b93304d40a8 6706
> 2015-04-20 10:01:07.985 INFO  [Thread-33][worker] Shutting down receive thread
> 2015-04-20 10:01:07.988 WARN  [Thread-33][ExponentialBackoffRetry] maxRetries 
> too large (300). Pinning to 29
> 2015-04-20 10:01:07.988 INFO  
> [Thread-33][StormBoundedExponentialBackoffRetry] The baseSleepTimeMs [100] 
> the maxSleepTimeMs [1000] the maxRetries [300]
> 2015-04-20 10:01:07.988 INFO  [Thread-33][Client] New Netty Client, connect 
> to localhost, 6706, config: , buffer_size: 5242880
> 2015-04-20 10:01:07.991 INFO  [client-schedule-service-1][Client] Reconnect 
> started for Netty-Client-localhost/127.0.0.1:6706... [0]
> 2015-04-20 10:01:07.996 INFO  [Thread-33][loader] Shutting down 
> receiving-thread: [BFDC-topology-DynamicCollect-68c9d7b4-72-1429491015, 6706]
> ...
> 2015-04-20 10:01:08.044 INFO  [Thread-33][Client] Closing Netty Client 
> Netty-Client-localhost/127.0.0.1:6706
> 2015-04-20 10:01:08.044 INFO  [Thread-33][Client] Waiting for pending batchs 
> to be sent with Netty-Client-localhost/127.0.0.1:6706..., timeout: 600000ms, 
> pendings: 1
> {code}
> I found activating log, but cannot find deactivating log.
> {code}
> 2015-04-20 09:50:24.556 INFO  [Thread-30-$mastercoord-bg0][executor] 
> Activating spout $mastercoord-bg0:(1)
> {code}
> Please note that it doesn't work when I just push button to "deactivate" 
> topology via UI.
> We're changing our Topology to normal Spout-Bolt, but personally I'd like to 
> see it resolved. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to