Github user revans2 commented on the issue:

    https://github.com/apache/storm/pull/2618
  
    @danny0405 
    
    {{updateBlobs}} does not need to be guarded by a lock.  This is what I was 
talking about with the code being complex.
    
    {{requestDownloadBaseTopologyBlobs}} is protected by a lock simply because 
of this non-thread safe code.
    
    
https://github.com/apache/storm/blob/402a371ccdb39ccd7146fe9743e91ca36fee6d15/storm-server/src/main/java/org/apache/storm/localizer/AsyncLocalizer.java#L222-L226
    
    Part of this were written prior to the more to java8 so {{computeIfAbsent}} 
was not available.  Now that it is we could replace it and I believe remove the 
lock, but I would want to spend some time to be sure it was not accidentally 
protecting something else in there too.
    
    {{requestDownloadTopologyBlobs}} looks like it does not need to be 
synchronized at all.  It must have been a mistake on my part, but it does look 
like it might be providing some protection to a bug in
    
    
https://github.com/apache/storm/blob/402a371ccdb39ccd7146fe9743e91ca36fee6d15/storm-server/src/main/java/org/apache/storm/localizer/AsyncLocalizer.java#L191-L195
    
    Which is executing outside of a lock, but looks to not be thread safe.
    
    Declaring {{updateBlobs}} as synchronized does absolutely noting except 
make it have conflicts with {{requestDownloadTopologyBlobs}} and 
{{requestDownloadBaseTopologyBlobs}}.  And if we are able to remove the locks 
there, then it will not be an issue at all.  {{updateBlobs}} is scheduled using 
{{scheduleWithFixedDelay}}
    
    
https://docs.oracle.com/javase/7/docs/api/java/util/concurrent/ScheduledExecutorService.html#scheduleWithFixedDelay(java.lang.Runnable,%20long,%20long,%20java.util.concurrent.TimeUnit)
    
    The javadocs clearly states that the next execution only starts a fixed 
delay after the previous one finished.  There will only ever be one copy it 
running at a time.  Additionally everything it does is already asynchronous so 
would be happening on a separate thread. Making it synchronized would just slow 
things down.  
    
    The having a blob disappear at the wrong time is a race that will always be 
in the system and we cannot fix it with synchronization because it is happening 
on separate servers.  The only thing we can do is to deal with it when it 
happens.  The way the current code deals with it is to try again later.  This 
means that a worker that is trying to come up for the first time will not come 
up until the blob is fully downloaded, but if we are trying to update the blob 
and it has disappeared we will simply keep the older version around until we 
don't need it any more.  Yes we may log some exceptions while we do it, but 
that is the worst thing that will happen.


---

Reply via email to