[jira] [Commented] (IMPALA-9150) Restarting minicluster breaks HBase on CDH GBN 1582079

ASF subversion and git services (Jira) Fri, 22 Nov 2019 20:48:41 -0800


    [ 
https://issues.apache.org/jira/browse/IMPALA-9150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16980663#comment-16980663
 ]


ASF subversion and git services commented on IMPALA-9150:
---------------------------------------------------------

Commit 4c09975c14f624028100e9940526a111897846cb in impala's branch 
refs/heads/master from Joe McDonnell
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=4c09975 ]

IMPALA-9165: Add back hard kill to kill-hbase.sh

The fix for IMPALA-9150 changed kill-hbase.sh to use HBase's
stop-hbase.sh script. Around this time, the GVO timeout issues
started. GVO can reuse machines, so we don't know what state
they may be in. If something failed to kill HBase processes,
the next job would need to be able to kill them even without
access to the last run's files / logs.

This restores the original kill logic to kill-hbase.sh, after
trying a graceful shutdown using HBase's stop-hbase.sh script.
The original kill logic doesn't rely on anything from the
filesystem to know about the existence of processes, so it
would handle machine reuse.

This also changes our Jenkins test scripts to shut down the
minicluster at the end.

Testing:
 - Started with a running minicluster, ran bin/clean.sh,
   then ran testdata/bin/kill-all.sh and verified that the
   java processes were gone

Change-Id: Ie2f0b342bcd1d8abea8ef923adbb54a14518a7a6
Reviewed-on: http://gerrit.cloudera.org:8080/14789
Reviewed-by: Impala Public Jenkins <[email protected]>
Tested-by: Impala Public Jenkins <[email protected]>


> Restarting minicluster breaks HBase on CDH GBN 1582079
> ------------------------------------------------------
>
>                 Key: IMPALA-9150
>                 URL: https://issues.apache.org/jira/browse/IMPALA-9150
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Infrastructure
>    Affects Versions: Impala 3.4.0
>            Reporter: Joe McDonnell
>            Priority: Blocker
>             Fix For: Impala 3.4.0
>
>
> On the most recent CDH GBN (1582079), restarting HBase using our normal 
> scripts (testdata/bin/kill-hbase.sh / testdata/bin/run-hbase.sh) results in 
> an unusable HBase. Our testdata/bin/kill-hbase.sh script use the 
> kill-java-service.sh script:
> {code:java}
> "$DIR"/kill-java-service.sh -c HRegionServer -c HMaster -c HQuorumPeer -s 2
> {code}
> This kills the region servers before the master. On CDH GBN 1582079, the 
> master gets unhappy:
> {noformat}
> 19/11/10 16:40:17 INFO master.RegionServerTracker: RegionServer ephemeral 
> node deleted, processing expiration [localhost,16022,1573402351656]
> 19/11/10 16:40:17 INFO master.ServerManager: Processing expiration of 
> localhost,16022,1573402351656 on localhost,16000,1573402349553
> ... same for other region servers ...
> 19/11/10 16:40:17 INFO procedure.ServerCrashProcedure: Start pid=102, 
> state=RUNNABLE:SERVER_CRASH_START, locked=true; ServerCrashProcedure 
> server=localhost,16022,1573402351656, splitWal=true, meta=false
> ... same for other region servers ...
> 19/11/10 16:40:17 INFO master.SplitLogManager: 
> hdfs://localhost:20500/hbase/WALs/localhost,16023,1573402352683-splitting dir 
> is empty, no logs to split.
> 19/11/10 16:40:17 INFO master.SplitLogManager: Finished splitting (more than 
> or equal to) 0 (0 bytes) in 0 log files in 
> [hdfs://localhost:20500/hbase/WALs/localhost,16023,1573402352683-splitting] 
> in 0ms
> ... more stuff ...
> 19/11/10 16:40:17 ERROR procedure2.ProcedureExecutor: CODE-BUG: Uncaught 
> runtime exception: pid=102, state=RUNNABLE:SERVER_CRASH_ASSIGN, locked=true; 
> ServerCrashProcedure server=localhost,16022,1573402351656, splitWal=true, 
> meta=false19/11/10 16:40:17 ERROR procedure2.ProcedureExecutor: CODE-BUG: 
> Uncaught runtime exception: pid=102, state=RUNNABLE:SERVER_CRASH_ASSIGN, 
> locked=true; ServerCrashProcedure server=localhost,16022,1573402351656, 
> splitWal=true, meta=falsejava.lang.NullPointerException at 
> org.apache.hadoop.hbase.master.assignment.AssignmentManager.createAssignProcedures(AssignmentManager.java:646)
>  at 
> org.apache.hadoop.hbase.master.assignment.AssignmentManager.createRoundRobinAssignProcedures(AssignmentManager.java:601)
>  at 
> org.apache.hadoop.hbase.master.assignment.AssignmentManager.createRoundRobinAssignProcedures(AssignmentManager.java:571)
>  at 
> org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.executeFromState(ServerCrashProcedure.java:188)
>  at 
> org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.executeFromState(ServerCrashProcedure.java:59)
>  at 
> org.apache.hadoop.hbase.procedure2.StateMachineProcedure.execute(StateMachineProcedure.java:189)
>  at 
> org.apache.hadoop.hbase.procedure2.Procedure.doExecute(Procedure.java:965) at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1742)
>  at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1481)
>  at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$1200(ProcedureExecutor.java:78)
>  at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:2058){noformat}
> Then, when the master starts up again, it remains unhappy:
> {noformat}
> 19/11/10 16:50:58 WARN master.HMaster: 
> hbase:namespace,,1573402362931.f310ca3bab11adb03eda8614e9ad980b. is NOT 
> online; state={f310ca3bab11adb03eda8614e9ad980b state=OPEN, ts=1573404657428, 
> server=localhost,16022,1573402351656}; ServerCrashProcedures=true. Master 
> startup cannot progress, in holding-pattern until region onlined.
> ... more of this ...
> 19/11/10 16:59:28 WARN master.HMaster: 
> hbase:namespace,,1573402362931.f310ca3bab11adb03eda8614e9ad980b. is NOT 
> online; state={f310ca3bab11adb03eda8614e9ad980b state=OPEN, ts=1573404657428, 
> server=localhost,16022,1573402351656}; ServerCrashProcedures=false. Master 
> startup cannot progress, in holding-pattern until region onlined.
> 19/11/10 17:05:46 ERROR master.HMaster: Master failed to complete 
> initialization after 900000ms. Please consider submitting a bug report 
> including a thread dump of this process.{noformat}
> This continues for an indefinite amount of time.
> Current workaround: Use HBase's bin/stop-hbase.sh script rather than our 
> testdata/bin/kill-hbase.sh script. I do not see the problem when using that 
> script, as it does a more graceful shutdown. We should look into changing 
> testdata/bin/kill-hbase.sh to use bin/stop-hbase.sh.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (IMPALA-9150) Restarting minicluster breaks HBase on CDH GBN 1582079

Reply via email to