GitHub user danny0405 opened a pull request: https://github.com/apache/storm/pull/2618
[STORM-2905] Fix KeyNotFoundException when kill a storm and add isRem⦠When we kill a topology, at the moment of topology blob-files be removed, Supervisor executor still request blob-files and get an KeyNotFoundException. I stepped in and found the reason: 1. We do not add a guarded lock on `topologyBlobs` of AsyncLocalizer which is singleton to a supervisor node. 2. And we remove jar/code/conf blob keys in `topologyBlobs` of killed storm only in a timer task: cleanUp() method of AsyncLocalizer, the remove condition is :[no one reference the blobs] AND [ blobs removed by master OR exceeds the max configured size ], the default scheduling interval is 30 seconds. 3. When we kill a storm on a node[ which means that the slot container are empty], the AsyncLocalizer will do: releaseSlotFor, which only remove reference on the blobs [topologyBlobs keys are still there.] 4. Then the container is empty, and Slot.java will do: cleanupCurrentContainer, which will invoke AsyncLocalizer #releaseSlotFor to release the slot. 5. AsyncLocalizer have a timer task: updateBlobs to update base/user blobs every 30 seconds, which based on the AsyncLocalizer#`topologyBlobs` 6. We know that AsyncLocalizer#`topologyBlobs` overdue keys are only removed by its AsyncLocalizer#cleanUp which is also a timer task. 7. So when we kill a storm, AsyncLocalizer#updateBlobs will update based on a removed jar/code/conf blob-key and fire a exception, then retried until the configured max times to end. Here is how i fixed it: 1. just remove the base blob keys eagerly when we do AsyncLocalizer #releaseSlotFor when there is no reference [no one used] on the blobs, and remove the overdue keys in AsyncLocalizer#`topologyBlobs` 2. Guard the AsyncLocalizer#updateBlobs and AsyncLocalizer #releaseSlotFor on the same lock. 3. When container is empty, we do not need to exec AsyncLocalizer #releaseSlotFor[because we have already deleted them]. 4. I also add a new RPC api for decide if there exists a remote blob, we can use it to decide it the blob could be removed instead of use getBlobMeta and catch an confusing KeyNotFoundException [both on supervisors and master log for every base blobs]. This is the JIRA: [STORM-2905](https://issues.apache.org/jira/browse/STORM-2905) You can merge this pull request into a Git repository by running: $ git pull https://github.com/danny0405/storm fix-kill Alternatively you can review and apply these changes as the patch at: https://github.com/apache/storm/pull/2618.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2618 ---- commit 7435dfd3cdb097097b69a96f3007a2c3a106fab4 Author: chenyuzhao <chenyuzhao@...> Date: 2018-03-31T07:15:28Z [STORM-2905] Fix KeyNotFoundException when kill a storm and add isRemoteBlobExists decide for blobs ---- ---