Ethan Li created STORM-2987:
-------------------------------

             Summary: PaceMakerStateStorage should deal with 
InterruptedException correctly
                 Key: STORM-2987
                 URL: https://issues.apache.org/jira/browse/STORM-2987
             Project: Apache Storm
          Issue Type: Bug
            Reporter: Ethan Li
            Assignee: Ethan Li


We found an issue that when nimbus restarted, it can only get leadership after 
a few seconds (15~20s). 

 
{code:java}
2018-02-27 08:18:43.420 main o.a.s.z.LeaderElectorImp [INFO] Queued up for 
leader lock.
2018-02-27 08:18:43.481 main o.a.s.d.m.MetricsUtils [INFO] Using statistics 
reporter plugin:org.apache.storm.daemon.metrics.reporters.JmxPreparableReporter
2018-02-27 08:18:43.483 main o.a.s.d.m.r.JmxPreparableReporter [INFO] 
Preparing...
2018-02-27 08:18:43.499 main o.a.s.m.StormMetricsRegistry [INFO] Started 
statistics report plugin...
2018-02-27 08:18:43.543 main o.a.s.m.n.Login [INFO] successfully logged in.
2018-02-27 08:18:43.551 main o.a.s.z.ClientZookeeper [INFO] Staring ZK Curator
2018-02-27 08:18:43.551 main o.a.c.f.i.CuratorFrameworkImpl [INFO] Starting
2018-02-27 08:18:43.552 Refresh-TGT o.a.s.m.n.Login [INFO] TGT refresh thread 
started.
2018-02-27 08:18:43.553 Refresh-TGT o.a.s.m.n.Login [INFO] TGT valid starting 
at:        Tue Feb 27 08:18:43 UTC 2018
2018-02-27 08:18:43.553 Refresh-TGT o.a.s.m.n.Login [INFO] TGT expires:         
         Wed Feb 28 08:18:43 UTC 2018
2018-02-27 08:18:43.553 Refresh-TGT o.a.s.m.n.Login [INFO] TGT refresh sleeping 
until: Wed Feb 28 04:35:55 UTC 2018
2018-02-27 08:18:43.553 main o.a.z.ZooKeeper [INFO] Initiating client 
connection, connectString=openqe74blue-gw.blue.ygrid.yahoo.com:2181 
sessionTimeout=60000 watcher=org.apache.
curator.ConnectionState@2e185cd7
2018-02-27 08:18:43.559 main o.a.c.f.i.CuratorFrameworkImpl [INFO] Default 
schema
2018-02-27 08:18:43.560 
main-SendThread(openqe74blue-gw.blue.ygrid.yahoo.com:2181) 
o.a.z.c.ZooKeeperSaslClient [INFO] Client will use GSSAPI as SASL mechanism.
2018-02-27 08:18:43.561 
main-SendThread(openqe74blue-gw.blue.ygrid.yahoo.com:2181) o.a.z.ClientCnxn 
[INFO] Opening socket connection to server openqe74blue-gw.blue.ygrid.yahoo.co
m/10.215.68.156:2181. Will attempt to SASL-authenticate using Login Context 
section 'Client'
2018-02-27 08:18:43.562 
main-SendThread(openqe74blue-gw.blue.ygrid.yahoo.com:2181) o.a.z.ClientCnxn 
[INFO] Socket connection established to openqe74blue-gw.blue.ygrid.yahoo.com/1
0.215.68.156:2181, initiating session
2018-02-27 08:18:43.565 
main-SendThread(openqe74blue-gw.blue.ygrid.yahoo.com:2181) o.a.z.ClientCnxn 
[INFO] Session establishment complete on server openqe74blue-gw.blue.ygrid.yah
oo.com/10.215.68.156:2181, sessionid = 0x161d5f1ae970099, negotiated timeout = 
40000
2018-02-27 08:18:43.565 main-EventThread o.a.c.f.s.ConnectionStateManager 
[INFO] State change: CONNECTED
2018-02-27 08:18:43.605 Curator-Framework-0 o.a.c.f.i.CuratorFrameworkImpl 
[INFO] backgroundOperationsLoop exiting
2018-02-27 08:18:43.625 main o.a.z.ZooKeeper [INFO] Session: 0x161d5f1ae970099 
closed
2018-02-27 08:18:43.625 main-EventThread o.a.z.ClientCnxn [INFO] EventThread 
shut down
2018-02-27 08:18:43.626 main o.a.s.z.ClientZookeeper [INFO] Staring ZK Curator
2018-02-27 08:18:43.626 main o.a.c.f.i.CuratorFrameworkImpl [INFO] Starting
2018-02-27 08:18:43.635 main o.a.z.ZooKeeper [INFO] Initiating client 
connection, 
connectString=openqe74blue-gw.blue.ygrid.yahoo.com:2181/storm_ystormQE_CI 
sessionTimeout=60000 w
atcher=org.apache.curator.ConnectionState@46cc127b
2018-02-27 08:18:43.654 
main-SendThread(openqe74blue-gw.blue.ygrid.yahoo.com:2181) 
o.a.z.c.ZooKeeperSaslClient [INFO] Client will use GSSAPI as SASL mechanism.
2018-02-27 08:18:43.660 
main-SendThread(openqe74blue-gw.blue.ygrid.yahoo.com:2181) o.a.z.ClientCnxn 
[INFO] Opening socket connection to server openqe74blue-gw.blue.ygrid.yahoo.co
m/10.215.68.156:2181. Will attempt to SASL-authenticate using Login Context 
section 'Client'
2018-02-27 08:18:43.663 main o.a.c.f.i.CuratorFrameworkImpl [INFO] Default 
schema
2018-02-27 08:18:43.663 
main-SendThread(openqe74blue-gw.blue.ygrid.yahoo.com:2181) o.a.z.ClientCnxn 
[INFO] Socket connection established to openqe74blue-gw.blue.ygrid.yahoo.com/1
0.215.68.156:2181, initiating session
2018-02-27 08:18:43.666 
main-SendThread(openqe74blue-gw.blue.ygrid.yahoo.com:2181) o.a.z.ClientCnxn 
[INFO] Session establishment complete on server openqe74blue-gw.blue.ygrid.yah
oo.com/10.215.68.156:2181, sessionid = 0x161d5f1ae97009a, negotiated timeout = 
40000
2018-02-27 08:18:43.669 main-EventThread o.a.c.f.s.ConnectionStateManager 
[INFO] State change: CONNECTED
2018-02-27 08:18:43.790 main o.a.s.d.n.Nimbus [INFO] Starting nimbus server for 
storm version '2.0.0.y'
2018-02-27 08:18:44.274 timer o.a.s.d.n.Nimbus [INFO] not a leader, skipping 
assignments
2018-02-27 08:18:44.274 timer o.a.s.d.n.Nimbus [INFO] not a leader, skipping 
cleanup
2018-02-27 08:18:44.300 timer o.a.s.b.BlobStoreUtils [ERROR] Could not download 
the blob with key: blob-5-1518767144-stormcode.ser
2018-02-27 08:18:44.301 timer o.a.s.b.BlobStoreUtils [ERROR] Could not download 
the blob with key: TestZkErrorNodesHaveCorrectAcls-3-1519540302-stormcode.ser
2018-02-27 08:18:44.302 timer o.a.s.b.BlobStoreUtils [ERROR] Could not download 
the blob with key: logviewer-ui-groups-test-1-1518940914-stormcode.ser
2018-02-27 08:18:44.303 timer o.a.s.b.BlobStoreUtils [ERROR] Could not download 
the blob with key: blob-5-1518800831-stormcode.ser
2018-02-27 08:18:44.304 timer o.a.s.b.BlobStoreUtils [ERROR] Could not download 
the blob with key: blob-5-1518767144-stormconf.ser
2018-02-27 08:18:44.306 timer o.a.s.b.BlobStoreUtils [ERROR] Could not download 
the blob with key: logviewer-ui-groups-test-1-1518940914-stormconf.ser
2018-02-27 08:18:44.307 timer o.a.s.b.BlobStoreUtils [ERROR] Could not download 
the blob with key: TestZkErrorNodesHaveCorrectAcls-3-1519540302-stormconf.ser
2018-02-27 08:18:44.308 timer o.a.s.b.BlobStoreUtils [ERROR] Could not download 
the blob with key: blob-5-1518800831-stormconf.ser
2018-02-27 08:18:44.367 timer o.a.s.d.n.Nimbus [INFO] not a leader, skipping 
credential renewal.
2018-02-27 08:18:54.274 timer o.a.s.d.n.Nimbus [INFO] not a leader, skipping 
assignments
2018-02-27 08:18:54.274 timer o.a.s.d.n.Nimbus [INFO] not a leader, skipping 
cleanup
2018-02-27 08:18:59.367 timer o.a.s.d.n.Nimbus [INFO] not a leader, skipping 
credential renewal.
2018-02-27 08:19:02.059 main-EventThread o.a.s.z.Zookeeper [INFO] 
active-topology-blobs [] local-topology-blobs [] diff-topology-blobs []
2018-02-27 08:19:02.059 main-EventThread o.a.s.z.Zookeeper [INFO] 
active-topology-dependencies [] local-blobs [] diff-topology-dependencies []
2018-02-27 08:19:02.059 main-EventThread o.a.s.z.Zookeeper [INFO] Accepting 
leadership, all active topologies and corresponding dependencies found locally.
2018-02-27 08:19:04.754 timer o.a.s.d.n.Nimbus [INFO] Scheduling took 442 ms 
for 0 topologies
{code}
 

This can be re-produced by the following steps:
{code:java}
1. restart pacemaker;
2. before pacemaker is up, restart nimbus{code}
So when we restart nimbus process,  it runs ShutDownHooks and stuck on 
timer.close(). 

https://github.com/apache/storm/blob/master/storm-server/src/main/java/org/apache/storm/daemon/nimbus/Nimbus.java#L4227

The timer is not able to close because it's waiting for  doCleanup() to stop. 
However, the interruptedException is caught and ate in : 
[https://github.com/apache/storm/blob/master/storm-client/src/jvm/org/apache/storm/pacemaker/PacemakerClient.java#L180-L192]

 

 

 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to