GitHub user danny0405 opened a pull request:

    https://github.com/apache/storm/pull/2618

    [STORM-2905] Fix KeyNotFoundException when kill a storm and add isRem…

    When we kill a topology, at the moment of topology blob-files be removed, 
Supervisor executor still request blob-files and get an KeyNotFoundException.
    
    I stepped in and found the reason:
    1. We do not add a guarded lock on `topologyBlobs` of AsyncLocalizer which 
is singleton to a supervisor node.
    2. And we remove jar/code/conf blob keys in `topologyBlobs` of killed storm 
only in a timer task: cleanUp() method of AsyncLocalizer, the remove condition 
is :[no one reference the blobs] AND [ blobs removed by master OR exceeds the 
max configured size ], the default scheduling interval is 30 seconds.
    3. When we kill a storm on a node[ which means that the slot container are 
empty], the AsyncLocalizer will do: releaseSlotFor, which only remove reference 
on the blobs [topologyBlobs keys are still there.]
    4. Then the container is empty, and Slot.java will do: 
cleanupCurrentContainer, which will invoke AsyncLocalizer #releaseSlotFor to 
release the slot.
    5. AsyncLocalizer have a timer task: updateBlobs to update base/user blobs 
every 30 seconds, which based on the AsyncLocalizer#`topologyBlobs`
    6. We know that AsyncLocalizer#`topologyBlobs` overdue keys are only 
removed by its AsyncLocalizer#cleanUp which is also a timer task.
    7. So when we kill a storm, AsyncLocalizer#updateBlobs will update based on 
a removed jar/code/conf blob-key and fire a exception, then retried until the 
configured max times to end.
    
    Here is how i fixed it:
    1. just remove the base blob keys eagerly when we do AsyncLocalizer 
#releaseSlotFor when there is no reference [no one used] on the blobs, and 
remove the overdue keys in AsyncLocalizer#`topologyBlobs`
    2. Guard the AsyncLocalizer#updateBlobs and AsyncLocalizer #releaseSlotFor 
on the same lock.
    3. When container is empty, we do not need to exec AsyncLocalizer 
#releaseSlotFor[because we have already deleted them].
    4. I also add a new RPC api for decide if there exists a remote blob, we 
can use it to decide it the blob could be removed instead of use getBlobMeta 
and catch an confusing KeyNotFoundException [both on supervisors and master log 
for every base blobs].
    
    This is the JIRA: 
[STORM-2905](https://issues.apache.org/jira/browse/STORM-2905)

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/danny0405/storm fix-kill

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/storm/pull/2618.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #2618
    
----
commit 7435dfd3cdb097097b69a96f3007a2c3a106fab4
Author: chenyuzhao <chenyuzhao@...>
Date:   2018-03-31T07:15:28Z

    [STORM-2905] Fix KeyNotFoundException when kill a storm and add 
isRemoteBlobExists decide for blobs

----


---

Reply via email to