[I] Unexpected blob update causing topology worker to be restarted (storm)

via GitHub Mon, 17 Feb 2025 01:22:39 -0800


silent5945 opened a new issue, #7966:
URL: https://github.com/apache/storm/issues/7966


   We are having a systematic issue in a production cluster with topology 
workers being intentionally killed by the supervisor while we want to 
absolutely avoid that. From the nimbus/supervisor logs it is clear that the 
topologies' worker are being restarted because of a topology blob update:
   
   ```
   2025-02-12 20:37:13.354 o.a.s.d.s.Container [INFO] Killing 
b573591b-fc08-4e6a-93e3-9857c6a64676-10.41.123.13:bfd5e3f0-2900-4d59-896e-77fa8e15b7a4
   2025-02-12 20:37:23.362 o.a.s.d.s.Slot [INFO] STATE running msInState: 
93944773 topo:TOPO_A worker:bfd5e3f0-2900-4d59-896e-77fa8e15b7a4 -> 
kill-blob-update msInState: 10001 topo:TOPO_A 
worker:bfd5e3f0-2900-4d59-896e-77fa8e15b7a4
   2025-02-12 20:37:33.682 o.a.s.d.s.Slot [INFO] STATE kill-blob-update 
msInState: 20321 topo:TOPO_A worker:null -> waiting-for-blob-update msInState: 1
   2025-02-13 01:37:38.064 o.a.s.d.s.Container [INFO] Killing 
b573591b-fc08-4e6a-93e3-9857c6a64676-10.41.123.13:e11d8695-82ef-4d75-a32e-fc1f5d4c3fff
   2025-02-13 01:37:48.068 o.a.s.d.s.Slot [INFO] STATE running msInState: 
322136 topo:TOPO_B worker:e11d8695-82ef-4d75-a32e-fc1f5d4c3fff -> 
kill-blob-update msInState: 10001 topo:TOPO_B 
worker:e11d8695-82ef-4d75-a32e-fc1f5d4c3fff
   2025-02-13 01:37:58.081 o.a.s.d.s.Slot [INFO] STATE kill-blob-update 
msInState: 20014 topo:TOPO_B worker:null -> waiting-for-blob-update msInState: 0
   2025-02-13 03:38:03.503 o.a.s.d.s.Container [INFO] Killing 
b573591b-fc08-4e6a-93e3-9857c6a64676-10.41.123.13:7b26a36d-c629-4d7a-bb62-e291766bad23
   2025-02-13 03:38:13.506 o.a.s.d.s.Slot [INFO] STATE running msInState: 
349122 topo:TOPO_C worker:7b26a36d-c629-4d7a-bb62-e291766bad23 -> 
kill-blob-update msInState: 10000 topo:TOPO_C 
worker:7b26a36d-c629-4d7a-bb62-e291766bad23
   2025-02-13 03:38:23.518 o.a.s.d.s.Slot [INFO] STATE kill-blob-update 
msInState: 20012 topo:TOPO_C worker:null -> waiting-for-blob-update msInState: 0
   2025-02-13 07:38:26.391 o.a.s.d.s.Container [INFO] Killing 
b573591b-fc08-4e6a-93e3-9857c6a64676-10.41.123.13:dec61a1e-3e6b-43eb-ba01-9778f1273fe8
   2025-02-13 07:38:36.395 o.a.s.d.s.Slot [INFO] STATE running msInState: 
373132 topo:TOPO_D worker:dec61a1e-3e6b-43eb-ba01-9778f1273fe8 -> 
kill-blob-update msInState: 10000 topo:TOPO_D 
worker:dec61a1e-3e6b-43eb-ba01-9778f1273fe8
   2025-02-13 07:38:46.409 o.a.s.d.s.Slot [INFO] STATE kill-blob-update 
msInState: 20014 topo:TOPO_D worker:null -> waiting-for-blob-update msInState: 0
   2025-02-13 12:38:49.243 o.a.s.d.s.Container [INFO] Killing 
b573591b-fc08-4e6a-93e3-9857c6a64676-10.41.123.13:6e835846-128e-45bc-82ad-a78895c20512
   2025-02-13 12:38:59.246 o.a.s.d.s.Slot [INFO] STATE running msInState: 
394136 topo:TOPO_E worker:6e835846-128e-45bc-82ad-a78895c20512 -> 
kill-blob-update msInState: 10000 topo:TOPO_E 
worker:6e835846-128e-45bc-82ad-a78895c20512
   2025-02-13 12:39:09.260 o.a.s.d.s.Slot [INFO] STATE kill-blob-update 
msInState: 20014 topo:TOPO_E worker:null -> waiting-for-blob-update msInState: 1
   ```
   
   Indeed we see in nimbus log that the topology blobs are being updated right 
before the killing:
   
   ```
   2025-02-12 20:30:03.329 o.a.s.d.n.Nimbus [INFO] Downloading 10 entries
   2025-02-12 20:30:03.476 o.a.s.c.StormClusterStateImpl [INFO] set-path: 
/blobstore/TOPO_A-stormjar.jar/nimbus-1:6627-2
   2025-02-12 20:30:03.551 o.a.s.c.StormClusterStateImpl [INFO] set-path: 
/blobstore/TOPO_A-stormconf.ser/nimbus-1:6627-2
   2025-02-12 20:30:04.732 o.a.s.c.StormClusterStateImpl [INFO] set-path: 
/blobstore/dep-b3e0e136-1c95-42e9-803e-676b4a8e972d.jar/nimbus-1:6627-3
   2025-02-12 20:30:04.806 o.a.s.d.n.Nimbus [INFO] No more blobs to list for 
session d2118791-893d-4f91-89c9-8b529a20782c
   2025-02-12 20:30:07.005 o.a.s.d.n.Nimbus [INFO] Downloading 10 entries
   2025-02-12 20:30:07.124 o.a.s.d.n.Nimbus [INFO] No more blobs to list for 
session 4636c761-251a-428c-8ca0-adca6110d059
   ...
   2025-02-12 20:37:11.304 o.a.s.d.n.Nimbus [INFO] Created download session 
47588811-8920-450e-a2e7-aa70241bc650 for TOPO_A-stormconf.ser
   2025-02-12 20:37:11.337 o.a.s.d.n.Nimbus [INFO] Created download session 
2de876d7-d4b0-46c7-b2d1-52db2100c1a5 for TOPO_A-stormjar.jar
   2025-02-12 20:37:11.381 o.a.s.d.n.Nimbus [INFO] Created download session 
8baabff7-c8bc-49a7-93e8-35b576457d28 for 
dep-b3e0e136-1c95-42e9-803e-676b4a8e972d.jar
   ```
   But we are not interacting in any way with the blobstore, there is no map 
defined or configured, we only submit the topology jar and that's it, nothing 
in our system is doing the update, nor with the API or the filesystem. Couple 
of things to note:
   
   - this is not caused by a Storm / Zookeeper failover or new leader election
   - this is not caused by a topology rebalance
   - this is not coming from Storm configuration 
`supervisor.localizer.cache.target.size.mb = 10240`, the local storage seems to 
be clean and stable, between 300m and 1GB
   - there is no error in Zookeeper around those times
   - it seems to be periodic (from the timestamps), the first time happening 
after 1d2h for a permanent topology (TOPO_A), but also happening for the other 
topologies which are short lived (10mins)
   
   So I have a couple of questions to try and understand what's happening here 
and hopefully prevent the worker restarts:
   
   1. Are there any mechanisms internal to Storm that will update a topology's 
blobs?
   2. Can the blob update be caused by a `storm blobstore list` command?
   3. Is it possible to prevent blob updates entirely at topology level?
   4. Can the blob update be cause by a Zookeeper periodic cleanup somehow?
   
   Thank you very much in advance for any help.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@storm.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[I] Unexpected blob update causing topology worker to be restarted (storm)

Reply via email to