[jira] [Commented] (IMPALA-9150) Restarting minicluster breaks HBase on CDH GBN 1582079

ASF subversion and git services (Jira) Wed, 13 Nov 2019 14:12:06 -0800


    [ 
https://issues.apache.org/jira/browse/IMPALA-9150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16973749#comment-16973749
 ]


ASF subversion and git services commented on IMPALA-9150:
---------------------------------------------------------

Commit f8c8fa5b454b7c56c8383dd103a6f6b87d231327 in impala's branch 
refs/heads/master from Joe McDonnell
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=f8c8fa5 ]

IMPALA-9150: Use HBase's stop-hbase.sh script for minicluster

testdata/bin/kill-hbase.sh currently uses the generic
kill-java-service.sh script to kill the region servers,
then the master, and then the zookeeper. Recent versions
of HBase become unusable after performing this type of
shutdown. The master seems to get stuck trying to recover,
even after restarting the minicluster.

The root cause in HBase is unclear, but HBase provides the
stop-hbase.sh script, which does a more graceful shutdown.
This switches tesdata/bin/kill-hbase.sh to use this script,
which avoids the recovery problems.

Testing:
 - Ran the test-with-docker.py tests (which does a minicluster
   restart). Before the change, the HBase tests timed out due
   to HBase getting stuck recovering. After the change, tests
   ran normally.
 - Added a minicluster restart after dataload so that this
   is tested.

Change-Id: I67283f9098c73c849023af8bfa7af62308bf3ed3
Reviewed-on: http://gerrit.cloudera.org:8080/14697
Reviewed-by: Vihang Karajgaonkar <[email protected]>
Tested-by: Impala Public Jenkins <[email protected]>


> Restarting minicluster breaks HBase on CDH GBN 1582079
> ------------------------------------------------------
>
>                 Key: IMPALA-9150
>                 URL: https://issues.apache.org/jira/browse/IMPALA-9150
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Infrastructure
>    Affects Versions: Impala 3.4.0
>            Reporter: Joe McDonnell
>            Priority: Blocker
>             Fix For: Impala 3.4.0
>
>
> On the most recent CDH GBN (1582079), restarting HBase using our normal 
> scripts (testdata/bin/kill-hbase.sh / testdata/bin/run-hbase.sh) results in 
> an unusable HBase. Our testdata/bin/kill-hbase.sh script use the 
> kill-java-service.sh script:
> {code:java}
> "$DIR"/kill-java-service.sh -c HRegionServer -c HMaster -c HQuorumPeer -s 2
> {code}
> This kills the region servers before the master. On CDH GBN 1582079, the 
> master gets unhappy:
> {noformat}
> 19/11/10 16:40:17 INFO master.RegionServerTracker: RegionServer ephemeral 
> node deleted, processing expiration [localhost,16022,1573402351656]
> 19/11/10 16:40:17 INFO master.ServerManager: Processing expiration of 
> localhost,16022,1573402351656 on localhost,16000,1573402349553
> ... same for other region servers ...
> 19/11/10 16:40:17 INFO procedure.ServerCrashProcedure: Start pid=102, 
> state=RUNNABLE:SERVER_CRASH_START, locked=true; ServerCrashProcedure 
> server=localhost,16022,1573402351656, splitWal=true, meta=false
> ... same for other region servers ...
> 19/11/10 16:40:17 INFO master.SplitLogManager: 
> hdfs://localhost:20500/hbase/WALs/localhost,16023,1573402352683-splitting dir 
> is empty, no logs to split.
> 19/11/10 16:40:17 INFO master.SplitLogManager: Finished splitting (more than 
> or equal to) 0 (0 bytes) in 0 log files in 
> [hdfs://localhost:20500/hbase/WALs/localhost,16023,1573402352683-splitting] 
> in 0ms
> ... more stuff ...
> 19/11/10 16:40:17 ERROR procedure2.ProcedureExecutor: CODE-BUG: Uncaught 
> runtime exception: pid=102, state=RUNNABLE:SERVER_CRASH_ASSIGN, locked=true; 
> ServerCrashProcedure server=localhost,16022,1573402351656, splitWal=true, 
> meta=false19/11/10 16:40:17 ERROR procedure2.ProcedureExecutor: CODE-BUG: 
> Uncaught runtime exception: pid=102, state=RUNNABLE:SERVER_CRASH_ASSIGN, 
> locked=true; ServerCrashProcedure server=localhost,16022,1573402351656, 
> splitWal=true, meta=falsejava.lang.NullPointerException at 
> org.apache.hadoop.hbase.master.assignment.AssignmentManager.createAssignProcedures(AssignmentManager.java:646)
>  at 
> org.apache.hadoop.hbase.master.assignment.AssignmentManager.createRoundRobinAssignProcedures(AssignmentManager.java:601)
>  at 
> org.apache.hadoop.hbase.master.assignment.AssignmentManager.createRoundRobinAssignProcedures(AssignmentManager.java:571)
>  at 
> org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.executeFromState(ServerCrashProcedure.java:188)
>  at 
> org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.executeFromState(ServerCrashProcedure.java:59)
>  at 
> org.apache.hadoop.hbase.procedure2.StateMachineProcedure.execute(StateMachineProcedure.java:189)
>  at 
> org.apache.hadoop.hbase.procedure2.Procedure.doExecute(Procedure.java:965) at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1742)
>  at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1481)
>  at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$1200(ProcedureExecutor.java:78)
>  at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:2058){noformat}
> Then, when the master starts up again, it remains unhappy:
> {noformat}
> 19/11/10 16:50:58 WARN master.HMaster: 
> hbase:namespace,,1573402362931.f310ca3bab11adb03eda8614e9ad980b. is NOT 
> online; state={f310ca3bab11adb03eda8614e9ad980b state=OPEN, ts=1573404657428, 
> server=localhost,16022,1573402351656}; ServerCrashProcedures=true. Master 
> startup cannot progress, in holding-pattern until region onlined.
> ... more of this ...
> 19/11/10 16:59:28 WARN master.HMaster: 
> hbase:namespace,,1573402362931.f310ca3bab11adb03eda8614e9ad980b. is NOT 
> online; state={f310ca3bab11adb03eda8614e9ad980b state=OPEN, ts=1573404657428, 
> server=localhost,16022,1573402351656}; ServerCrashProcedures=false. Master 
> startup cannot progress, in holding-pattern until region onlined.
> 19/11/10 17:05:46 ERROR master.HMaster: Master failed to complete 
> initialization after 900000ms. Please consider submitting a bug report 
> including a thread dump of this process.{noformat}
> This continues for an indefinite amount of time.
> Current workaround: Use HBase's bin/stop-hbase.sh script rather than our 
> testdata/bin/kill-hbase.sh script. I do not see the problem when using that 
> script, as it does a more graceful shutdown. We should look into changing 
> testdata/bin/kill-hbase.sh to use bin/stop-hbase.sh.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (IMPALA-9150) Restarting minicluster breaks HBase on CDH GBN 1582079

Reply via email to