[ 
https://issues.apache.org/jira/browse/HBASE-28023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haiping lv updated HBASE-28023:
-------------------------------
    Description: 
When running ITBLL, a problem occurs that ultimately results in all region 
servers being suspended.

The following is the ITBLL running command:
{code:java}
hbase org.apache.hadoop.hbase.test.IntegrationTestBigLinkedList 
-DIntegrationTestBigLinkedList.table=itbll -m slowDeterministic Loop 10 10 
10000000 /tmp/biglinkedlist 100 {code}
I have summarized the process as follows:
 # The Action RollingBatchSuspendResumeRsAction in ITBLL will execute the "sudo 
-u hbase ps ux | grep proc_regionserver | grep -v grep | tr -s ' ' | cut -d ' ' 
-f2 | xargs kill -s SIGSTOP" command to suspend the RegionServer process.
 # This command will pause the RegionServer process, rather than kill it.
 # The Action uses the waitForServiceToStop method to check if the execution 
was successful, using the "sudo -u hbase ps ux | grep proc_regionserver | grep 
-v grep | tr -s ' ' | cut -d ' ' -f2" command.
 # The waitForServiceToStop method used to check if the execution was 
successful does not match the suspendRs, causing ITBLL to not resume the 
RegionServer process and ultimately resulting in all RegionServer processes 
being suspended. Therefore, ITBLL fails to run.

{code:java}
2023-07-21 11:18:23,103 WARN  [ChaosMonkey-2] policies.Policy 
(DoActionsOncePolicy.java:runOneIteration(51)) - Exception occurred during 
performing action: java.io.IOException: Timed-out waiting for service to stop: 
core-1-3,16020,1689908619650
        at 
org.apache.hadoop.hbase.DistributedHBaseCluster.waitForServiceToStop(DistributedHBaseCluster.java:282)
        at 
org.apache.hadoop.hbase.DistributedHBaseCluster.waitForRegionServerToStop(DistributedHBaseCluster.java:131)
        at 
org.apache.hadoop.hbase.chaos.actions.Action.suspendRs(Action.java:200)
        at 
org.apache.hadoop.hbase.chaos.actions.RollingBatchSuspendResumeRsAction.perform(RollingBatchSuspendResumeRsAction.java:97)
        at 
org.apache.hadoop.hbase.chaos.policies.DoActionsOncePolicy.runOneIteration(DoActionsOncePolicy.java:48)
        at 
org.apache.hadoop.hbase.chaos.policies.PeriodicPolicy.run(PeriodicPolicy.java:41)
        at 
org.apache.hadoop.hbase.chaos.policies.CompositeSequentialPolicy.run(CompositeSequentialPolicy.java:42)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:750)
 {code}

  was:
When running ITBLL, a problem occurs that ultimately results in all region 
servers being suspended.

The following is the ITBLL running command:
{code:java}
hbase org.apache.hadoop.hbase.test.IntegrationTestBigLinkedList 
-DIntegrationTestBigLinkedList.table=itbll -m slowDeterministic Loop 10 10 
10000000 /tmp/biglinkedlist 100 {code}
I have summarized the process as follows:
 # The Action RollingBatchSuspendResumeRsAction in ITBLL will execute the "sudo 
-u hbase ps ux | grep proc_regionserver | grep -v grep | tr -s ' ' | cut -d ' ' 
-f2 | xargs kill -s SIGSTOP" command to suspend the RegionServer process.
 # This command will pause the RegionServer process, rather than kill it.
 # The Action uses the waitForServiceToStop method to check if the execution 
was successful, using the "sudo -u hbase ps ux | grep proc_regionserver | grep 
-v grep | tr -s ' ' | cut -d ' ' -f2" command.
 # The waitForServiceToStop method used to check if the execution was 
successful does not match the suspendRs, causing ITBLL to not resume the 
RegionServer process and ultimately resulting in all RegionServer processes 
being suspended. Therefore, ITBLL fails to run.

{code:java}
2023-07-21 11:18:23,103 WARN  [ChaosMonkey-2] policies.Policy 
(DoActionsOncePolicy.java:runOneIteration(51)) - Exception occurred during 
performing action: java.io.IOException: Timed-out waiting for service to stop: 
core-1-3.c-c25e3e8da545bfd2.cn-hangzhou.emr.aliyuncs.com,16020,1689908619650
        at 
org.apache.hadoop.hbase.DistributedHBaseCluster.waitForServiceToStop(DistributedHBaseCluster.java:282)
        at 
org.apache.hadoop.hbase.DistributedHBaseCluster.waitForRegionServerToStop(DistributedHBaseCluster.java:131)
        at 
org.apache.hadoop.hbase.chaos.actions.Action.suspendRs(Action.java:200)
        at 
org.apache.hadoop.hbase.chaos.actions.RollingBatchSuspendResumeRsAction.perform(RollingBatchSuspendResumeRsAction.java:97)
        at 
org.apache.hadoop.hbase.chaos.policies.DoActionsOncePolicy.runOneIteration(DoActionsOncePolicy.java:48)
        at 
org.apache.hadoop.hbase.chaos.policies.PeriodicPolicy.run(PeriodicPolicy.java:41)
        at 
org.apache.hadoop.hbase.chaos.policies.CompositeSequentialPolicy.run(CompositeSequentialPolicy.java:42)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:750)
 {code}


> ITBLL's RollingBatchSuspendResumeRsAction runs the "suspendRs" method to 
> perform the action, but it inadvertently uses the "waitForRegionServerToStop" 
> method to check if it was executed successfully.
> -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-28023
>                 URL: https://issues.apache.org/jira/browse/HBASE-28023
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 3.0.0-alpha-1
>            Reporter: Haiping lv
>            Priority: Major
>
> When running ITBLL, a problem occurs that ultimately results in all region 
> servers being suspended.
> The following is the ITBLL running command:
> {code:java}
> hbase org.apache.hadoop.hbase.test.IntegrationTestBigLinkedList 
> -DIntegrationTestBigLinkedList.table=itbll -m slowDeterministic Loop 10 10 
> 10000000 /tmp/biglinkedlist 100 {code}
> I have summarized the process as follows:
>  # The Action RollingBatchSuspendResumeRsAction in ITBLL will execute the 
> "sudo -u hbase ps ux | grep proc_regionserver | grep -v grep | tr -s ' ' | 
> cut -d ' ' -f2 | xargs kill -s SIGSTOP" command to suspend the RegionServer 
> process.
>  # This command will pause the RegionServer process, rather than kill it.
>  # The Action uses the waitForServiceToStop method to check if the execution 
> was successful, using the "sudo -u hbase ps ux | grep proc_regionserver | 
> grep -v grep | tr -s ' ' | cut -d ' ' -f2" command.
>  # The waitForServiceToStop method used to check if the execution was 
> successful does not match the suspendRs, causing ITBLL to not resume the 
> RegionServer process and ultimately resulting in all RegionServer processes 
> being suspended. Therefore, ITBLL fails to run.
> {code:java}
> 2023-07-21 11:18:23,103 WARN  [ChaosMonkey-2] policies.Policy 
> (DoActionsOncePolicy.java:runOneIteration(51)) - Exception occurred during 
> performing action: java.io.IOException: Timed-out waiting for service to 
> stop: core-1-3,16020,1689908619650
>         at 
> org.apache.hadoop.hbase.DistributedHBaseCluster.waitForServiceToStop(DistributedHBaseCluster.java:282)
>         at 
> org.apache.hadoop.hbase.DistributedHBaseCluster.waitForRegionServerToStop(DistributedHBaseCluster.java:131)
>         at 
> org.apache.hadoop.hbase.chaos.actions.Action.suspendRs(Action.java:200)
>         at 
> org.apache.hadoop.hbase.chaos.actions.RollingBatchSuspendResumeRsAction.perform(RollingBatchSuspendResumeRsAction.java:97)
>         at 
> org.apache.hadoop.hbase.chaos.policies.DoActionsOncePolicy.runOneIteration(DoActionsOncePolicy.java:48)
>         at 
> org.apache.hadoop.hbase.chaos.policies.PeriodicPolicy.run(PeriodicPolicy.java:41)
>         at 
> org.apache.hadoop.hbase.chaos.policies.CompositeSequentialPolicy.run(CompositeSequentialPolicy.java:42)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>         at java.lang.Thread.run(Thread.java:750)
>  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to