Ethanlm opened a new pull request #3299:
URL: https://github.com/apache/storm/pull/3299


   ## What is the purpose of the change
   
   Details explained in https://issues.apache.org/jira/browse/STORM-3658.
   
   In short, the problem here is 
   1. there could be a race condition between registering the shutdown hooks 
and invoking shutdown hooks
   2. There is a possible deadlock between a thread like executor thread that 
invokes shutdown hook (when it invokes Runtime.getRuntime().exit) and the 
shutdown hook/thread itself that tries to terminate threads like the executor 
thread.
   
   ## How was the change tested
   
   1. The deadlock and race condition can be reproduced with a 
WordCountTopology whose WordCountBolt simply throws a RuntimeException in 
`prepare` method.  Use it to validate the change. The race condition and 
deadlock is not happening. 
   
   The log below shows shutdown hook waited for 100ms and continue. This avoids 
the deadlock.
   ```
   2020-06-29 20:33:44.237 o.a.s.e.ExecutorShutdown ShutdownHook-shutdownFunc 
[WARN] Thread Thread-15-count-executor[1, 1] is still alive (100 ms after 
interruption). Stop wait
   ing for it.
   ```
   
   2. The shutdown hooks are registered at very early stage (before any other 
threads are spawned). This is validated by looking at the logs and see `Adding 
shutdown hook with kill in 3 secs` message appears very early. And I added a 
bug in my test to print out thread dump right after registering the shutdown 
hook. The only alive threads at that point are
   ```
   Log4j2-TF-6-Scheduled-1
   Reference Handler
   Signal Dispatcher
   Finalizer
   main
   ```
   And also tested with the above topology and don't see race condition happen 
again (while it was very easy to happen before the code change)
   
   3. thread name change can be validated by the log
   ```
   2020-06-29 20:33:44.023 o.a.s.u.Utils ShutdownHook-sleepKill-3s [INFO] 
Halting after 3 seconds
   ....
   2020-06-29 20:33:44.040 o.a.s.d.w.Worker ShutdownHook-shutdownFunc [INFO] 
Shutting down worker wc4-15-1593462538 
513216e5-17e1-421f-809e-47cefd6eb136-10.215.73.209 6702
   ...
   ```
   
   4. `workerstate` could be null if exception happens before `workerstate` 
object is initialized properly. This was tested by throwing an exception before 
`workerstate` construction is all done. And the modified shutdown method 
doesn't throw NPE.
   
   5. I noticed that logs like 
   ```
   Trigger any worker shutdown hooks
   Disconnecting from storm cluster state context
   Shut down worker
   ```
   don't show up in the worker log file with or without this code change, with 
a good or bad topology. That should be taken care of separately.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to