Github user revans2 commented on the issue:
https://github.com/apache/storm/pull/2618
@danny0405
{{updateBlobs}} does not need to be guarded by a lock. This is what I was
talking about with the code being complex.
{{requestDownloadBaseTopologyBlobs}} is protected by a lock simply because
of this non-thread safe code.
https://github.com/apache/storm/blob/402a371ccdb39ccd7146fe9743e91ca36fee6d15/storm-server/src/main/java/org/apache/storm/localizer/AsyncLocalizer.java#L222-L226
Part of this were written prior to the more to java8 so {{computeIfAbsent}}
was not available. Now that it is we could replace it and I believe remove the
lock, but I would want to spend some time to be sure it was not accidentally
protecting something else in there too.
{{requestDownloadTopologyBlobs}} looks like it does not need to be
synchronized at all. It must have been a mistake on my part, but it does look
like it might be providing some protection to a bug in
https://github.com/apache/storm/blob/402a371ccdb39ccd7146fe9743e91ca36fee6d15/storm-server/src/main/java/org/apache/storm/localizer/AsyncLocalizer.java#L191-L195
Which is executing outside of a lock, but looks to not be thread safe.
Declaring {{updateBlobs}} as synchronized does absolutely noting except
make it have conflicts with {{requestDownloadTopologyBlobs}} and
{{requestDownloadBaseTopologyBlobs}}. And if we are able to remove the locks
there, then it will not be an issue at all. {{updateBlobs}} is scheduled using
{{scheduleWithFixedDelay}}
https://docs.oracle.com/javase/7/docs/api/java/util/concurrent/ScheduledExecutorService.html#scheduleWithFixedDelay(java.lang.Runnable,%20long,%20long,%20java.util.concurrent.TimeUnit)
The javadocs clearly states that the next execution only starts a fixed
delay after the previous one finished. There will only ever be one copy it
running at a time. Additionally everything it does is already asynchronous so
would be happening on a separate thread. Making it synchronized would just slow
things down.
The having a blob disappear at the wrong time is a race that will always be
in the system and we cannot fix it with synchronization because it is happening
on separate servers. The only thing we can do is to deal with it when it
happens. The way the current code deals with it is to try again later. This
means that a worker that is trying to come up for the first time will not come
up until the blob is fully downloaded, but if we are trying to update the blob
and it has disappeared we will simply keep the older version around until we
don't need it any more. Yes we may log some exceptions while we do it, but
that is the worst thing that will happen.
---