GitHub user danny0405 opened a pull request:
https://github.com/apache/storm/pull/2618
[STORM-2905] Fix KeyNotFoundException when kill a storm and add isRemâ¦
When we kill a topology, at the moment of topology blob-files be removed,
Supervisor executor still request blob-files and get an KeyNotFoundException.
I stepped in and found the reason:
1. We do not add a guarded lock on `topologyBlobs` of AsyncLocalizer which
is singleton to a supervisor node.
2. And we remove jar/code/conf blob keys in `topologyBlobs` of killed storm
only in a timer task: cleanUp() method of AsyncLocalizer, the remove condition
is :[no one reference the blobs] AND [ blobs removed by master OR exceeds the
max configured size ], the default scheduling interval is 30 seconds.
3. When we kill a storm on a node[ which means that the slot container are
empty], the AsyncLocalizer will do: releaseSlotFor, which only remove reference
on the blobs [topologyBlobs keys are still there.]
4. Then the container is empty, and Slot.java will do:
cleanupCurrentContainer, which will invoke AsyncLocalizer #releaseSlotFor to
release the slot.
5. AsyncLocalizer have a timer task: updateBlobs to update base/user blobs
every 30 seconds, which based on the AsyncLocalizer#`topologyBlobs`
6. We know that AsyncLocalizer#`topologyBlobs` overdue keys are only
removed by its AsyncLocalizer#cleanUp which is also a timer task.
7. So when we kill a storm, AsyncLocalizer#updateBlobs will update based on
a removed jar/code/conf blob-key and fire a exception, then retried until the
configured max times to end.
Here is how i fixed it:
1. just remove the base blob keys eagerly when we do AsyncLocalizer
#releaseSlotFor when there is no reference [no one used] on the blobs, and
remove the overdue keys in AsyncLocalizer#`topologyBlobs`
2. Guard the AsyncLocalizer#updateBlobs and AsyncLocalizer #releaseSlotFor
on the same lock.
3. When container is empty, we do not need to exec AsyncLocalizer
#releaseSlotFor[because we have already deleted them].
4. I also add a new RPC api for decide if there exists a remote blob, we
can use it to decide it the blob could be removed instead of use getBlobMeta
and catch an confusing KeyNotFoundException [both on supervisors and master log
for every base blobs].
This is the JIRA:
[STORM-2905](https://issues.apache.org/jira/browse/STORM-2905)
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/danny0405/storm fix-kill
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/storm/pull/2618.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #2618
----
commit 7435dfd3cdb097097b69a96f3007a2c3a106fab4
Author: chenyuzhao <chenyuzhao@...>
Date: 2018-03-31T07:15:28Z
[STORM-2905] Fix KeyNotFoundException when kill a storm and add
isRemoteBlobExists decide for blobs
----
---