Joe McDonnell created IMPALA-9150:
-------------------------------------
Summary: Restarting minicluster breaks HBase on CDH GBN 1582079
Key: IMPALA-9150
URL: https://issues.apache.org/jira/browse/IMPALA-9150
Project: IMPALA
Issue Type: Bug
Components: Infrastructure
Affects Versions: Impala 3.4.0
Reporter: Joe McDonnell
On the most recent CDH GBN (1582079), restarting HBase using our normal scripts
(testdata/bin/kill-hbase.sh / testdata/bin/run-hbase.sh) results in an unusable
HBase. Our testdata/bin/kill-hbase.sh script use the kill-java-service.sh
script:
{code:java}
"$DIR"/kill-java-service.sh -c HRegionServer -c HMaster -c HQuorumPeer -s 2
{code}
This kills the region servers before the master. On CDH GBN 1582079, the master
gets unhappy:
{noformat}
19/11/10 16:40:17 INFO master.RegionServerTracker: RegionServer ephemeral node
deleted, processing expiration [localhost,16022,1573402351656]
19/11/10 16:40:17 INFO master.ServerManager: Processing expiration of
localhost,16022,1573402351656 on localhost,16000,1573402349553
... same for other region servers ...
19/11/10 16:40:17 INFO procedure.ServerCrashProcedure: Start pid=102,
state=RUNNABLE:SERVER_CRASH_START, locked=true; ServerCrashProcedure
server=localhost,16022,1573402351656, splitWal=true, meta=false
... same for other region servers ...
19/11/10 16:40:17 INFO master.SplitLogManager:
hdfs://localhost:20500/hbase/WALs/localhost,16023,1573402352683-splitting dir
is empty, no logs to split.
19/11/10 16:40:17 INFO master.SplitLogManager: Finished splitting (more than or
equal to) 0 (0 bytes) in 0 log files in
[hdfs://localhost:20500/hbase/WALs/localhost,16023,1573402352683-splitting] in
0ms
... more stuff ...
19/11/10 16:40:17 ERROR procedure2.ProcedureExecutor: CODE-BUG: Uncaught
runtime exception: pid=102, state=RUNNABLE:SERVER_CRASH_ASSIGN, locked=true;
ServerCrashProcedure server=localhost,16022,1573402351656, splitWal=true,
meta=false19/11/10 16:40:17 ERROR procedure2.ProcedureExecutor: CODE-BUG:
Uncaught runtime exception: pid=102, state=RUNNABLE:SERVER_CRASH_ASSIGN,
locked=true; ServerCrashProcedure server=localhost,16022,1573402351656,
splitWal=true, meta=falsejava.lang.NullPointerException at
org.apache.hadoop.hbase.master.assignment.AssignmentManager.createAssignProcedures(AssignmentManager.java:646)
at
org.apache.hadoop.hbase.master.assignment.AssignmentManager.createRoundRobinAssignProcedures(AssignmentManager.java:601)
at
org.apache.hadoop.hbase.master.assignment.AssignmentManager.createRoundRobinAssignProcedures(AssignmentManager.java:571)
at
org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.executeFromState(ServerCrashProcedure.java:188)
at
org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.executeFromState(ServerCrashProcedure.java:59)
at
org.apache.hadoop.hbase.procedure2.StateMachineProcedure.execute(StateMachineProcedure.java:189)
at org.apache.hadoop.hbase.procedure2.Procedure.doExecute(Procedure.java:965)
at
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1742)
at
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1481)
at
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$1200(ProcedureExecutor.java:78)
at
org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:2058){noformat}
Then, when the master starts up again, it remains unhappy:
{noformat}
19/11/10 16:50:58 WARN master.HMaster:
hbase:namespace,,1573402362931.f310ca3bab11adb03eda8614e9ad980b. is NOT online;
state={f310ca3bab11adb03eda8614e9ad980b state=OPEN, ts=1573404657428,
server=localhost,16022,1573402351656}; ServerCrashProcedures=true. Master
startup cannot progress, in holding-pattern until region onlined.
... more of this ...
19/11/10 16:59:28 WARN master.HMaster:
hbase:namespace,,1573402362931.f310ca3bab11adb03eda8614e9ad980b. is NOT online;
state={f310ca3bab11adb03eda8614e9ad980b state=OPEN, ts=1573404657428,
server=localhost,16022,1573402351656}; ServerCrashProcedures=false. Master
startup cannot progress, in holding-pattern until region onlined.
19/11/10 17:05:46 ERROR master.HMaster: Master failed to complete
initialization after 900000ms. Please consider submitting a bug report
including a thread dump of this process.{noformat}
This continues for an indefinite amount of time.
Current workaround: Use HBase's bin/stop-hbase.sh script rather than our
testdata/bin/kill-hbase.sh script. I do not see the problem when using that
script, as it does a more graceful shutdown. We should look into changing
testdata/bin/kill-hbase.sh to use bin/stop-hbase.sh.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]