Zhengdai Hu created STORM-3104:
----------------------------------
Summary: Delayed launch due to accidental transitioning in state
machine
Key: STORM-3104
URL: https://issues.apache.org/jira/browse/STORM-3104
Project: Apache Storm
Issue Type: Bug
Components: storm-server
Affects Versions: 2.0.0
Reporter: Zhengdai Hu
Fix For: 2.0.0
There is a comparison in
{code:java}
handleWaitingForBlobUpdate()
{code}
between dynamic state's current assignment and new assignment, which
accidentally route back state machine just transitioned from
WAIT_FOR_BLOB_LOCALIZATION back to WAIT_FOR_BLOB_LOCALIZATION. This is because
the current assignment in this case is highly likely to be null (I'm not sure
if it's guaranteed) and causes delay for a worker to start/restart.
The symptom is able to be reproduced by launching an empty supervisor and
submit any topology. Here's the log sample:
{code:shell}
2018-06-13 16:57:10.254 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE EMPTY msInState:
6024 -> EMPTY msInState: 6024
2018-06-13 16:57:10.255 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE EMPTY
2018-06-13 16:57:10.257 o.a.s.d.s.Slot SLOT_6700 [DEBUG] Transition from EMPTY
to WAITING_FOR_BLOB_LOCALIZATION
2018-06-13 16:57:10.257 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE EMPTY msInState:
6027 -> WAITING_FOR_BLOB_LOCALIZATION msInState: 0
2018-06-13 16:57:10.258 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE
WAITING_FOR_BLOB_LOCALIZATION
2018-06-13 16:57:10.258 o.a.s.d.s.Slot SLOT_6700 [DEBUG] pendingChangingBlobs
are []
2018-06-13 16:57:11.259 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE
WAITING_FOR_BLOB_LOCALIZATION msInState: 1003 -> WAITING_FOR_BLOB_LOCALIZATION
msInState: 1003
2018-06-13 16:57:11.260 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE
WAITING_FOR_BLOB_LOCALIZATION
2018-06-13 16:57:11.260 o.a.s.d.s.Slot SLOT_6700 [DEBUG] found changing blobs
[BLOB CHANGING LOCAL TOPO BLOB TOPO_CONF test-1-1528927024
LocalAssignment(topology_id:test-1-1528927024,
executors:[ExecutorInfo(task_start:10, task_end:10),
ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4,
task_end:4), ExecutorInfo(task_start:7, task_end:7), ExecutorInfo(task_start:1,
task_end:1), ExecutorInfo(task_start:13, task_end:13)],
resources:WorkerResources(mem_on_heap:768.0, mem_off_heap:0.0, cpu:60.0,
shared_mem_on_heap:0.0, shared_mem_off_heap:0.0,
resources:{offheap.memory.mb=0.0, onheap.memory.mb=768.0,
cpu.pcore.percent=60.0}, shared_resources:{}), owner:zhu02), BLOB CHANGING
LOCAL TOPO BLOB TOPO_CODE test-1-1528927024
LocalAssignment(topology_id:test-1-1528927024,
executors:[ExecutorInfo(task_start:10, task_end:10),
ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4,
task_end:4), ExecutorInfo(task_start:7, task_end:7), ExecutorInfo(task_start:1,
task_end:1), ExecutorInfo(task_start:13, task_end:13)],
resources:WorkerResources(mem_on_heap:768.0, mem_off_heap:0.0, cpu:60.0,
shared_mem_on_heap:0.0, shared_mem_off_heap:0.0,
resources:{offheap.memory.mb=0.0, onheap.memory.mb=768.0,
cpu.pcore.percent=60.0}, shared_resources:{}), owner:zhu02)] moving them to
pending...
2018-06-13 16:57:12.262 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE
WAITING_FOR_BLOB_LOCALIZATION msInState: 2005 -> WAITING_FOR_BLOB_LOCALIZATION
msInState: 2005
2018-06-13 16:57:12.263 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE
WAITING_FOR_BLOB_LOCALIZATION
2018-06-13 16:57:12.263 o.a.s.d.s.Slot SLOT_6700 [DEBUG] found changing blobs
[BLOB CHANGING LOCAL TOPO BLOB TOPO_JAR test-1-1528927024
LocalAssignment(topology_id:test-1-1528927024,
executors:[ExecutorInfo(task_start:10, task_end:10),
ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4,
task_end:4), ExecutorInfo(task_start:7, task_end:7), ExecutorInfo(task_start:1,
task_end:1), ExecutorInfo(task_start:13, task_end:13)],
resources:WorkerResources(mem_on_heap:768.0, mem_off_heap:0.0, cpu:60.0,
shared_mem_on_heap:0.0, shared_mem_off_heap:0.0,
resources:{offheap.memory.mb=0.0, onheap.memory.mb=768.0,
cpu.pcore.percent=60.0}, shared_resources:{}), owner:zhu02)] moving them to
pending...
2018-06-13 16:57:12.274 o.a.s.d.s.Slot SLOT_6700 [DEBUG] pendingLocalization
LocalAssignment(topology_id:test-1-1528927024,
executors:[ExecutorInfo(task_start:10, task_end:10),
ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4,
task_end:4), ExecutorInfo(task_start:7, task_end:7), ExecutorInfo(task_start:1,
task_end:1), ExecutorInfo(task_start:13, task_end:13)],
resources:WorkerResources(mem_on_heap:768.0, mem_off_heap:0.0, cpu:60.0,
shared_mem_on_heap:0.0, shared_mem_off_heap:0.0,
resources:{offheap.memory.mb=0.0, onheap.memory.mb=768.0,
cpu.pcore.percent=60.0}, shared_resources:{}), owner:zhu02) == current null ?
false
2018-06-13 16:57:12.274 o.a.s.d.s.Slot SLOT_6700 [INFO] There are pending
changes, waiting for them to finish before launching container...
2018-06-13 16:57:12.275 o.a.s.d.s.Slot SLOT_6700 [DEBUG] Transition from
WAITING_FOR_BLOB_LOCALIZATION to WAITING_FOR_BLOB_UPDATE
2018-06-13 16:57:12.275 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE
WAITING_FOR_BLOB_LOCALIZATION msInState: 2018 -> WAITING_FOR_BLOB_UPDATE
msInState: 1
2018-06-13 16:57:12.275 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE
WAITING_FOR_BLOB_UPDATE
2018-06-13 16:57:12.275 o.a.s.d.s.Slot SLOT_6700 [DEBUG] pendingLocalization:
null, new: LocalAssignment(topology_id:test-1-1528927024,
executors:[ExecutorInfo(task_start:10, task_end:10),
ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4,
task_end:4), ExecutorInfo(task_start:7, task_end:7), ExecutorInfo(task_start:1,
task_end:1), ExecutorInfo(task_start:13, task_end:13)],
resources:WorkerResources(mem_on_heap:768.0, mem_off_heap:0.0, cpu:60.0,
shared_mem_on_heap:0.0, shared_mem_off_heap:0.0,
resources:{offheap.memory.mb=0.0, onheap.memory.mb=768.0,
cpu.pcore.percent=60.0}, shared_resources:{}), owner:zhu02), current: null,
pdchanging: LocalAssignment(topology_id:test-1-1528927024,
executors:[ExecutorInfo(task_start:10, task_end:10),
ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4,
task_end:4), ExecutorInfo(task_start:7, task_end:7), ExecutorInfo(task_start:1,
task_end:1), ExecutorInfo(task_start:13, task_end:13)],
resources:WorkerResources(mem_on_heap:768.0, mem_off_heap:0.0, cpu:60.0,
shared_mem_on_heap:0.0, shared_mem_off_heap:0.0,
resources:{offheap.memory.mb=0.0, onheap.memory.mb=768.0,
cpu.pcore.percent=60.0}, shared_resources:{}), owner:zhu02)
2018-06-13 16:57:12.276 o.a.s.d.s.Slot SLOT_6700 [INFO] SLOT 6700: Assignment
Changed from null to LocalAssignment(topology_id:test-1-1528927024,
executors:[ExecutorInfo(task_start:10, task_end:10),
ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4,
task_end:4), ExecutorInfo(task_start:7, task_end:7), ExecutorInfo(task_start:1,
task_end:1), ExecutorInfo(task_start:13, task_end:13)],
resources:WorkerResources(mem_on_heap:768.0, mem_off_heap:0.0, cpu:60.0,
shared_mem_on_heap:0.0, shared_mem_off_heap:0.0,
resources:{offheap.memory.mb=0.0, onheap.memory.mb=768.0,
cpu.pcore.percent=60.0}, shared_resources:{}), owner:zhu02)
2018-06-13 16:57:12.278 o.a.s.d.s.Slot SLOT_6700 [DEBUG] Transition from
WAITING_FOR_BLOB_UPDATE to WAITING_FOR_BLOB_LOCALIZATION
2018-06-13 16:57:12.278 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE
WAITING_FOR_BLOB_UPDATE msInState: 4 -> WAITING_FOR_BLOB_LOCALIZATION
msInState: 0
2018-06-13 16:57:12.279 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE
WAITING_FOR_BLOB_LOCALIZATION
2018-06-13 16:57:12.279 o.a.s.d.s.Slot SLOT_6700 [DEBUG] pendingChangingBlobs
are []
2018-06-13 16:57:12.279 o.a.s.d.s.Slot SLOT_6700 [DEBUG] pendingLocalization
LocalAssignment(topology_id:test-1-1528927024,
executors:[ExecutorInfo(task_start:10, task_end:10),
ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:4,
task_end:4), ExecutorInfo(task_start:7, task_end:7), ExecutorInfo(task_start:1,
task_end:1), ExecutorInfo(task_start:13, task_end:13)],
resources:WorkerResources(mem_on_heap:768.0, mem_off_heap:0.0, cpu:60.0,
shared_mem_on_heap:0.0, shared_mem_off_heap:0.0,
resources:{offheap.memory.mb=0.0, onheap.memory.mb=768.0,
cpu.pcore.percent=60.0}, shared_resources:{}), owner:zhu02) == current null ?
false
2018-06-13 16:57:12.280 o.a.s.d.s.Slot SLOT_6700 [DEBUG] launch container for
the first time
2018-06-13 16:57:12.284 o.a.s.d.s.BasicContainer SLOT_6700 [INFO] Created
Worker ID 4155b2bb-ebd1-431d-907f-d8a4ff1e1da4
{code}
I would like to know if this is actually the desired behavior of the state
machine, or I can help fix the bug. The implementation would be to redesign the
if statement.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)