Re: [I] Unexpected blob update causing topology worker to be restarted (storm)

Rui Abreu Mon, 17 Feb 2025 02:21:45 -0800

Which Storm version are you using?
Is this happening outside human triggered deployments?
There was a recent fix related with unwanted topology redeployments:


https://github.com/apache/storm/pull/3697


On Mon, 17 Feb 2025 at 09:22, silent5945 (via GitHub) <g...@apache.org> wrote:
>
>
> silent5945 opened a new issue, #7966:
> URL: https://github.com/apache/storm/issues/7966
>
>    We are having a systematic issue in a production cluster with topology 
> workers being intentionally killed by the supervisor while we want to 
> absolutely avoid that. From the nimbus/supervisor logs it is clear that the 
> topologies' worker are being restarted because of a topology blob update:
>
>    ```
>    2025-02-12 20:37:13.354 o.a.s.d.s.Container [INFO] Killing 
> b573591b-fc08-4e6a-93e3-9857c6a64676-10.41.123.13:bfd5e3f0-2900-4d59-896e-77fa8e15b7a4
>    2025-02-12 20:37:23.362 o.a.s.d.s.Slot [INFO] STATE running msInState: 
> 93944773 topo:TOPO_A worker:bfd5e3f0-2900-4d59-896e-77fa8e15b7a4 -> 
> kill-blob-update msInState: 10001 topo:TOPO_A 
> worker:bfd5e3f0-2900-4d59-896e-77fa8e15b7a4
>    2025-02-12 20:37:33.682 o.a.s.d.s.Slot [INFO] STATE kill-blob-update 
> msInState: 20321 topo:TOPO_A worker:null -> waiting-for-blob-update 
> msInState: 1
>    2025-02-13 01:37:38.064 o.a.s.d.s.Container [INFO] Killing 
> b573591b-fc08-4e6a-93e3-9857c6a64676-10.41.123.13:e11d8695-82ef-4d75-a32e-fc1f5d4c3fff
>    2025-02-13 01:37:48.068 o.a.s.d.s.Slot [INFO] STATE running msInState: 
> 322136 topo:TOPO_B worker:e11d8695-82ef-4d75-a32e-fc1f5d4c3fff -> 
> kill-blob-update msInState: 10001 topo:TOPO_B 
> worker:e11d8695-82ef-4d75-a32e-fc1f5d4c3fff
>    2025-02-13 01:37:58.081 o.a.s.d.s.Slot [INFO] STATE kill-blob-update 
> msInState: 20014 topo:TOPO_B worker:null -> waiting-for-blob-update 
> msInState: 0
>    2025-02-13 03:38:03.503 o.a.s.d.s.Container [INFO] Killing 
> b573591b-fc08-4e6a-93e3-9857c6a64676-10.41.123.13:7b26a36d-c629-4d7a-bb62-e291766bad23
>    2025-02-13 03:38:13.506 o.a.s.d.s.Slot [INFO] STATE running msInState: 
> 349122 topo:TOPO_C worker:7b26a36d-c629-4d7a-bb62-e291766bad23 -> 
> kill-blob-update msInState: 10000 topo:TOPO_C 
> worker:7b26a36d-c629-4d7a-bb62-e291766bad23
>    2025-02-13 03:38:23.518 o.a.s.d.s.Slot [INFO] STATE kill-blob-update 
> msInState: 20012 topo:TOPO_C worker:null -> waiting-for-blob-update 
> msInState: 0
>    2025-02-13 07:38:26.391 o.a.s.d.s.Container [INFO] Killing 
> b573591b-fc08-4e6a-93e3-9857c6a64676-10.41.123.13:dec61a1e-3e6b-43eb-ba01-9778f1273fe8
>    2025-02-13 07:38:36.395 o.a.s.d.s.Slot [INFO] STATE running msInState: 
> 373132 topo:TOPO_D worker:dec61a1e-3e6b-43eb-ba01-9778f1273fe8 -> 
> kill-blob-update msInState: 10000 topo:TOPO_D 
> worker:dec61a1e-3e6b-43eb-ba01-9778f1273fe8
>    2025-02-13 07:38:46.409 o.a.s.d.s.Slot [INFO] STATE kill-blob-update 
> msInState: 20014 topo:TOPO_D worker:null -> waiting-for-blob-update 
> msInState: 0
>    2025-02-13 12:38:49.243 o.a.s.d.s.Container [INFO] Killing 
> b573591b-fc08-4e6a-93e3-9857c6a64676-10.41.123.13:6e835846-128e-45bc-82ad-a78895c20512
>    2025-02-13 12:38:59.246 o.a.s.d.s.Slot [INFO] STATE running msInState: 
> 394136 topo:TOPO_E worker:6e835846-128e-45bc-82ad-a78895c20512 -> 
> kill-blob-update msInState: 10000 topo:TOPO_E 
> worker:6e835846-128e-45bc-82ad-a78895c20512
>    2025-02-13 12:39:09.260 o.a.s.d.s.Slot [INFO] STATE kill-blob-update 
> msInState: 20014 topo:TOPO_E worker:null -> waiting-for-blob-update 
> msInState: 1
>    ```
>
>    Indeed we see in nimbus log that the topology blobs are being updated 
> right before the killing:
>
>    ```
>    2025-02-12 20:30:03.329 o.a.s.d.n.Nimbus [INFO] Downloading 10 entries
>    2025-02-12 20:30:03.476 o.a.s.c.StormClusterStateImpl [INFO] set-path: 
> /blobstore/TOPO_A-stormjar.jar/nimbus-1:6627-2
>    2025-02-12 20:30:03.551 o.a.s.c.StormClusterStateImpl [INFO] set-path: 
> /blobstore/TOPO_A-stormconf.ser/nimbus-1:6627-2
>    2025-02-12 20:30:04.732 o.a.s.c.StormClusterStateImpl [INFO] set-path: 
> /blobstore/dep-b3e0e136-1c95-42e9-803e-676b4a8e972d.jar/nimbus-1:6627-3
>    2025-02-12 20:30:04.806 o.a.s.d.n.Nimbus [INFO] No more blobs to list for 
> session d2118791-893d-4f91-89c9-8b529a20782c
>    2025-02-12 20:30:07.005 o.a.s.d.n.Nimbus [INFO] Downloading 10 entries
>    2025-02-12 20:30:07.124 o.a.s.d.n.Nimbus [INFO] No more blobs to list for 
> session 4636c761-251a-428c-8ca0-adca6110d059
>    ...
>    2025-02-12 20:37:11.304 o.a.s.d.n.Nimbus [INFO] Created download session 
> 47588811-8920-450e-a2e7-aa70241bc650 for TOPO_A-stormconf.ser
>    2025-02-12 20:37:11.337 o.a.s.d.n.Nimbus [INFO] Created download session 
> 2de876d7-d4b0-46c7-b2d1-52db2100c1a5 for TOPO_A-stormjar.jar
>    2025-02-12 20:37:11.381 o.a.s.d.n.Nimbus [INFO] Created download session 
> 8baabff7-c8bc-49a7-93e8-35b576457d28 for 
> dep-b3e0e136-1c95-42e9-803e-676b4a8e972d.jar
>    ```
>    But we are not interacting in any way with the blobstore, there is no map 
> defined or configured, we only submit the topology jar and that's it, nothing 
> in our system is doing the update, nor with the API or the filesystem. Couple 
> of things to note:
>
>    - this is not caused by a Storm / Zookeeper failover or new leader election
>    - this is not caused by a topology rebalance
>    - this is not coming from Storm configuration 
> `supervisor.localizer.cache.target.size.mb = 10240`, the local storage seems 
> to be clean and stable, between 300m and 1GB
>    - there is no error in Zookeeper around those times
>    - it seems to be periodic (from the timestamps), the first time happening 
> after 1d2h for a permanent topology (TOPO_A), but also happening for the 
> other topologies which are short lived (10mins)
>
>    So I have a couple of questions to try and understand what's happening 
> here and hopefully prevent the worker restarts:
>
>    1. Are there any mechanisms internal to Storm that will update a 
> topology's blobs?
>    2. Can the blob update be caused by a `storm blobstore list` command?
>    3. Is it possible to prevent blob updates entirely at topology level?
>    4. Can the blob update be cause by a Zookeeper periodic cleanup somehow?
>
>    Thank you very much in advance for any help.
>
>
> --
> This is an automated message from the Apache Git Service.
> To respond to the message, please log on to GitHub and use the
> URL above to go to the specific comment.
>
> To unsubscribe, e-mail: dev-unsubscr...@storm.apache.org.apache.org
>
> For queries about this service, please contact Infrastructure at:
> us...@infra.apache.org
>

Re: [I] Unexpected blob update causing topology worker to be restarted (storm)

Reply via email to