Joe McDonnell created IMPALA-9150:
-------------------------------------

             Summary: Restarting minicluster breaks HBase on CDH GBN 1582079
                 Key: IMPALA-9150
                 URL: https://issues.apache.org/jira/browse/IMPALA-9150
             Project: IMPALA
          Issue Type: Bug
          Components: Infrastructure
    Affects Versions: Impala 3.4.0
            Reporter: Joe McDonnell


On the most recent CDH GBN (1582079), restarting HBase using our normal scripts 
(testdata/bin/kill-hbase.sh / testdata/bin/run-hbase.sh) results in an unusable 
HBase. Our testdata/bin/kill-hbase.sh script use the kill-java-service.sh 
script:
{code:java}
"$DIR"/kill-java-service.sh -c HRegionServer -c HMaster -c HQuorumPeer -s 2
{code}
This kills the region servers before the master. On CDH GBN 1582079, the master 
gets unhappy:
{noformat}
19/11/10 16:40:17 INFO master.RegionServerTracker: RegionServer ephemeral node 
deleted, processing expiration [localhost,16022,1573402351656]
19/11/10 16:40:17 INFO master.ServerManager: Processing expiration of 
localhost,16022,1573402351656 on localhost,16000,1573402349553
... same for other region servers ...
19/11/10 16:40:17 INFO procedure.ServerCrashProcedure: Start pid=102, 
state=RUNNABLE:SERVER_CRASH_START, locked=true; ServerCrashProcedure 
server=localhost,16022,1573402351656, splitWal=true, meta=false
... same for other region servers ...
19/11/10 16:40:17 INFO master.SplitLogManager: 
hdfs://localhost:20500/hbase/WALs/localhost,16023,1573402352683-splitting dir 
is empty, no logs to split.
19/11/10 16:40:17 INFO master.SplitLogManager: Finished splitting (more than or 
equal to) 0 (0 bytes) in 0 log files in 
[hdfs://localhost:20500/hbase/WALs/localhost,16023,1573402352683-splitting] in 
0ms
... more stuff ...
19/11/10 16:40:17 ERROR procedure2.ProcedureExecutor: CODE-BUG: Uncaught 
runtime exception: pid=102, state=RUNNABLE:SERVER_CRASH_ASSIGN, locked=true; 
ServerCrashProcedure server=localhost,16022,1573402351656, splitWal=true, 
meta=false19/11/10 16:40:17 ERROR procedure2.ProcedureExecutor: CODE-BUG: 
Uncaught runtime exception: pid=102, state=RUNNABLE:SERVER_CRASH_ASSIGN, 
locked=true; ServerCrashProcedure server=localhost,16022,1573402351656, 
splitWal=true, meta=falsejava.lang.NullPointerException at 
org.apache.hadoop.hbase.master.assignment.AssignmentManager.createAssignProcedures(AssignmentManager.java:646)
 at 
org.apache.hadoop.hbase.master.assignment.AssignmentManager.createRoundRobinAssignProcedures(AssignmentManager.java:601)
 at 
org.apache.hadoop.hbase.master.assignment.AssignmentManager.createRoundRobinAssignProcedures(AssignmentManager.java:571)
 at 
org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.executeFromState(ServerCrashProcedure.java:188)
 at 
org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.executeFromState(ServerCrashProcedure.java:59)
 at 
org.apache.hadoop.hbase.procedure2.StateMachineProcedure.execute(StateMachineProcedure.java:189)
 at org.apache.hadoop.hbase.procedure2.Procedure.doExecute(Procedure.java:965) 
at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1742)
 at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1481)
 at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$1200(ProcedureExecutor.java:78)
 at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:2058){noformat}
Then, when the master starts up again, it remains unhappy:
{noformat}
19/11/10 16:50:58 WARN master.HMaster: 
hbase:namespace,,1573402362931.f310ca3bab11adb03eda8614e9ad980b. is NOT online; 
state={f310ca3bab11adb03eda8614e9ad980b state=OPEN, ts=1573404657428, 
server=localhost,16022,1573402351656}; ServerCrashProcedures=true. Master 
startup cannot progress, in holding-pattern until region onlined.
... more of this ...
19/11/10 16:59:28 WARN master.HMaster: 
hbase:namespace,,1573402362931.f310ca3bab11adb03eda8614e9ad980b. is NOT online; 
state={f310ca3bab11adb03eda8614e9ad980b state=OPEN, ts=1573404657428, 
server=localhost,16022,1573402351656}; ServerCrashProcedures=false. Master 
startup cannot progress, in holding-pattern until region onlined.
19/11/10 17:05:46 ERROR master.HMaster: Master failed to complete 
initialization after 900000ms. Please consider submitting a bug report 
including a thread dump of this process.{noformat}
This continues for an indefinite amount of time.

Current workaround: Use HBase's bin/stop-hbase.sh script rather than our 
testdata/bin/kill-hbase.sh script. I do not see the problem when using that 
script, as it does a more graceful shutdown. We should look into changing 
testdata/bin/kill-hbase.sh to use bin/stop-hbase.sh.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to