silent5945 opened a new issue, #7966: URL: https://github.com/apache/storm/issues/7966
We are having a systematic issue in a production cluster with topology workers being intentionally killed by the supervisor while we want to absolutely avoid that. From the nimbus/supervisor logs it is clear that the topologies' worker are being restarted because of a topology blob update: ``` 2025-02-12 20:37:13.354 o.a.s.d.s.Container [INFO] Killing b573591b-fc08-4e6a-93e3-9857c6a64676-10.41.123.13:bfd5e3f0-2900-4d59-896e-77fa8e15b7a4 2025-02-12 20:37:23.362 o.a.s.d.s.Slot [INFO] STATE running msInState: 93944773 topo:TOPO_A worker:bfd5e3f0-2900-4d59-896e-77fa8e15b7a4 -> kill-blob-update msInState: 10001 topo:TOPO_A worker:bfd5e3f0-2900-4d59-896e-77fa8e15b7a4 2025-02-12 20:37:33.682 o.a.s.d.s.Slot [INFO] STATE kill-blob-update msInState: 20321 topo:TOPO_A worker:null -> waiting-for-blob-update msInState: 1 2025-02-13 01:37:38.064 o.a.s.d.s.Container [INFO] Killing b573591b-fc08-4e6a-93e3-9857c6a64676-10.41.123.13:e11d8695-82ef-4d75-a32e-fc1f5d4c3fff 2025-02-13 01:37:48.068 o.a.s.d.s.Slot [INFO] STATE running msInState: 322136 topo:TOPO_B worker:e11d8695-82ef-4d75-a32e-fc1f5d4c3fff -> kill-blob-update msInState: 10001 topo:TOPO_B worker:e11d8695-82ef-4d75-a32e-fc1f5d4c3fff 2025-02-13 01:37:58.081 o.a.s.d.s.Slot [INFO] STATE kill-blob-update msInState: 20014 topo:TOPO_B worker:null -> waiting-for-blob-update msInState: 0 2025-02-13 03:38:03.503 o.a.s.d.s.Container [INFO] Killing b573591b-fc08-4e6a-93e3-9857c6a64676-10.41.123.13:7b26a36d-c629-4d7a-bb62-e291766bad23 2025-02-13 03:38:13.506 o.a.s.d.s.Slot [INFO] STATE running msInState: 349122 topo:TOPO_C worker:7b26a36d-c629-4d7a-bb62-e291766bad23 -> kill-blob-update msInState: 10000 topo:TOPO_C worker:7b26a36d-c629-4d7a-bb62-e291766bad23 2025-02-13 03:38:23.518 o.a.s.d.s.Slot [INFO] STATE kill-blob-update msInState: 20012 topo:TOPO_C worker:null -> waiting-for-blob-update msInState: 0 2025-02-13 07:38:26.391 o.a.s.d.s.Container [INFO] Killing b573591b-fc08-4e6a-93e3-9857c6a64676-10.41.123.13:dec61a1e-3e6b-43eb-ba01-9778f1273fe8 2025-02-13 07:38:36.395 o.a.s.d.s.Slot [INFO] STATE running msInState: 373132 topo:TOPO_D worker:dec61a1e-3e6b-43eb-ba01-9778f1273fe8 -> kill-blob-update msInState: 10000 topo:TOPO_D worker:dec61a1e-3e6b-43eb-ba01-9778f1273fe8 2025-02-13 07:38:46.409 o.a.s.d.s.Slot [INFO] STATE kill-blob-update msInState: 20014 topo:TOPO_D worker:null -> waiting-for-blob-update msInState: 0 2025-02-13 12:38:49.243 o.a.s.d.s.Container [INFO] Killing b573591b-fc08-4e6a-93e3-9857c6a64676-10.41.123.13:6e835846-128e-45bc-82ad-a78895c20512 2025-02-13 12:38:59.246 o.a.s.d.s.Slot [INFO] STATE running msInState: 394136 topo:TOPO_E worker:6e835846-128e-45bc-82ad-a78895c20512 -> kill-blob-update msInState: 10000 topo:TOPO_E worker:6e835846-128e-45bc-82ad-a78895c20512 2025-02-13 12:39:09.260 o.a.s.d.s.Slot [INFO] STATE kill-blob-update msInState: 20014 topo:TOPO_E worker:null -> waiting-for-blob-update msInState: 1 ``` Indeed we see in nimbus log that the topology blobs are being updated right before the killing: ``` 2025-02-12 20:30:03.329 o.a.s.d.n.Nimbus [INFO] Downloading 10 entries 2025-02-12 20:30:03.476 o.a.s.c.StormClusterStateImpl [INFO] set-path: /blobstore/TOPO_A-stormjar.jar/nimbus-1:6627-2 2025-02-12 20:30:03.551 o.a.s.c.StormClusterStateImpl [INFO] set-path: /blobstore/TOPO_A-stormconf.ser/nimbus-1:6627-2 2025-02-12 20:30:04.732 o.a.s.c.StormClusterStateImpl [INFO] set-path: /blobstore/dep-b3e0e136-1c95-42e9-803e-676b4a8e972d.jar/nimbus-1:6627-3 2025-02-12 20:30:04.806 o.a.s.d.n.Nimbus [INFO] No more blobs to list for session d2118791-893d-4f91-89c9-8b529a20782c 2025-02-12 20:30:07.005 o.a.s.d.n.Nimbus [INFO] Downloading 10 entries 2025-02-12 20:30:07.124 o.a.s.d.n.Nimbus [INFO] No more blobs to list for session 4636c761-251a-428c-8ca0-adca6110d059 ... 2025-02-12 20:37:11.304 o.a.s.d.n.Nimbus [INFO] Created download session 47588811-8920-450e-a2e7-aa70241bc650 for TOPO_A-stormconf.ser 2025-02-12 20:37:11.337 o.a.s.d.n.Nimbus [INFO] Created download session 2de876d7-d4b0-46c7-b2d1-52db2100c1a5 for TOPO_A-stormjar.jar 2025-02-12 20:37:11.381 o.a.s.d.n.Nimbus [INFO] Created download session 8baabff7-c8bc-49a7-93e8-35b576457d28 for dep-b3e0e136-1c95-42e9-803e-676b4a8e972d.jar ``` But we are not interacting in any way with the blobstore, there is no map defined or configured, we only submit the topology jar and that's it, nothing in our system is doing the update, nor with the API or the filesystem. Couple of things to note: - this is not caused by a Storm / Zookeeper failover or new leader election - this is not caused by a topology rebalance - this is not coming from Storm configuration `supervisor.localizer.cache.target.size.mb = 10240`, the local storage seems to be clean and stable, between 300m and 1GB - there is no error in Zookeeper around those times - it seems to be periodic (from the timestamps), the first time happening after 1d2h for a permanent topology (TOPO_A), but also happening for the other topologies which are short lived (10mins) So I have a couple of questions to try and understand what's happening here and hopefully prevent the worker restarts: 1. Are there any mechanisms internal to Storm that will update a topology's blobs? 2. Can the blob update be caused by a `storm blobstore list` command? 3. Is it possible to prevent blob updates entirely at topology level? 4. Can the blob update be cause by a Zookeeper periodic cleanup somehow? Thank you very much in advance for any help. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@storm.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org