[
https://issues.apache.org/jira/browse/HBASE-28023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Haiping lv updated HBASE-28023:
-------------------------------
Description:
When running ITBLL, a problem occurs that ultimately results in all region
servers being suspended.
The following is the ITBLL running command:
{code:java}
hbase org.apache.hadoop.hbase.test.IntegrationTestBigLinkedList
-DIntegrationTestBigLinkedList.table=itbll -m slowDeterministic Loop 10 10
10000000 /tmp/biglinkedlist 100 {code}
I have summarized the process as follows:
# The Action RollingBatchSuspendResumeRsAction in ITBLL will execute the "sudo
-u hbase ps ux | grep proc_regionserver | grep -v grep | tr -s ' ' | cut -d ' '
-f2 | xargs kill -s SIGSTOP" command to suspend the RegionServer process.
# This command will pause the RegionServer process, rather than kill it.
# The Action uses the waitForServiceToStop method to check if the execution
was successful, using the "sudo -u hbase ps ux | grep proc_regionserver | grep
-v grep | tr -s ' ' | cut -d ' ' -f2" command.
# The waitForServiceToStop method used to check if the execution was
successful does not match the suspendRs, causing ITBLL to not resume the
RegionServer process and ultimately resulting in all RegionServer processes
being suspended. Therefore, ITBLL fails to run.
{code:java}
2023-07-21 11:18:23,103 WARN [ChaosMonkey-2] policies.Policy
(DoActionsOncePolicy.java:runOneIteration(51)) - Exception occurred during
performing action: java.io.IOException: Timed-out waiting for service to stop:
core-1-3,16020,1689908619650
at
org.apache.hadoop.hbase.DistributedHBaseCluster.waitForServiceToStop(DistributedHBaseCluster.java:282)
at
org.apache.hadoop.hbase.DistributedHBaseCluster.waitForRegionServerToStop(DistributedHBaseCluster.java:131)
at
org.apache.hadoop.hbase.chaos.actions.Action.suspendRs(Action.java:200)
at
org.apache.hadoop.hbase.chaos.actions.RollingBatchSuspendResumeRsAction.perform(RollingBatchSuspendResumeRsAction.java:97)
at
org.apache.hadoop.hbase.chaos.policies.DoActionsOncePolicy.runOneIteration(DoActionsOncePolicy.java:48)
at
org.apache.hadoop.hbase.chaos.policies.PeriodicPolicy.run(PeriodicPolicy.java:41)
at
org.apache.hadoop.hbase.chaos.policies.CompositeSequentialPolicy.run(CompositeSequentialPolicy.java:42)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
{code}
was:
When running ITBLL, a problem occurs that ultimately results in all region
servers being suspended.
The following is the ITBLL running command:
{code:java}
hbase org.apache.hadoop.hbase.test.IntegrationTestBigLinkedList
-DIntegrationTestBigLinkedList.table=itbll -m slowDeterministic Loop 10 10
10000000 /tmp/biglinkedlist 100 {code}
I have summarized the process as follows:
# The Action RollingBatchSuspendResumeRsAction in ITBLL will execute the "sudo
-u hbase ps ux | grep proc_regionserver | grep -v grep | tr -s ' ' | cut -d ' '
-f2 | xargs kill -s SIGSTOP" command to suspend the RegionServer process.
# This command will pause the RegionServer process, rather than kill it.
# The Action uses the waitForServiceToStop method to check if the execution
was successful, using the "sudo -u hbase ps ux | grep proc_regionserver | grep
-v grep | tr -s ' ' | cut -d ' ' -f2" command.
# The waitForServiceToStop method used to check if the execution was
successful does not match the suspendRs, causing ITBLL to not resume the
RegionServer process and ultimately resulting in all RegionServer processes
being suspended. Therefore, ITBLL fails to run.
{code:java}
2023-07-21 11:18:23,103 WARN [ChaosMonkey-2] policies.Policy
(DoActionsOncePolicy.java:runOneIteration(51)) - Exception occurred during
performing action: java.io.IOException: Timed-out waiting for service to stop:
core-1-3.c-c25e3e8da545bfd2.cn-hangzhou.emr.aliyuncs.com,16020,1689908619650
at
org.apache.hadoop.hbase.DistributedHBaseCluster.waitForServiceToStop(DistributedHBaseCluster.java:282)
at
org.apache.hadoop.hbase.DistributedHBaseCluster.waitForRegionServerToStop(DistributedHBaseCluster.java:131)
at
org.apache.hadoop.hbase.chaos.actions.Action.suspendRs(Action.java:200)
at
org.apache.hadoop.hbase.chaos.actions.RollingBatchSuspendResumeRsAction.perform(RollingBatchSuspendResumeRsAction.java:97)
at
org.apache.hadoop.hbase.chaos.policies.DoActionsOncePolicy.runOneIteration(DoActionsOncePolicy.java:48)
at
org.apache.hadoop.hbase.chaos.policies.PeriodicPolicy.run(PeriodicPolicy.java:41)
at
org.apache.hadoop.hbase.chaos.policies.CompositeSequentialPolicy.run(CompositeSequentialPolicy.java:42)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
{code}
> ITBLL's RollingBatchSuspendResumeRsAction runs the "suspendRs" method to
> perform the action, but it inadvertently uses the "waitForRegionServerToStop"
> method to check if it was executed successfully.
> -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> Key: HBASE-28023
> URL: https://issues.apache.org/jira/browse/HBASE-28023
> Project: HBase
> Issue Type: Bug
> Affects Versions: 3.0.0-alpha-1
> Reporter: Haiping lv
> Priority: Major
>
> When running ITBLL, a problem occurs that ultimately results in all region
> servers being suspended.
> The following is the ITBLL running command:
> {code:java}
> hbase org.apache.hadoop.hbase.test.IntegrationTestBigLinkedList
> -DIntegrationTestBigLinkedList.table=itbll -m slowDeterministic Loop 10 10
> 10000000 /tmp/biglinkedlist 100 {code}
> I have summarized the process as follows:
> # The Action RollingBatchSuspendResumeRsAction in ITBLL will execute the
> "sudo -u hbase ps ux | grep proc_regionserver | grep -v grep | tr -s ' ' |
> cut -d ' ' -f2 | xargs kill -s SIGSTOP" command to suspend the RegionServer
> process.
> # This command will pause the RegionServer process, rather than kill it.
> # The Action uses the waitForServiceToStop method to check if the execution
> was successful, using the "sudo -u hbase ps ux | grep proc_regionserver |
> grep -v grep | tr -s ' ' | cut -d ' ' -f2" command.
> # The waitForServiceToStop method used to check if the execution was
> successful does not match the suspendRs, causing ITBLL to not resume the
> RegionServer process and ultimately resulting in all RegionServer processes
> being suspended. Therefore, ITBLL fails to run.
> {code:java}
> 2023-07-21 11:18:23,103 WARN [ChaosMonkey-2] policies.Policy
> (DoActionsOncePolicy.java:runOneIteration(51)) - Exception occurred during
> performing action: java.io.IOException: Timed-out waiting for service to
> stop: core-1-3,16020,1689908619650
> at
> org.apache.hadoop.hbase.DistributedHBaseCluster.waitForServiceToStop(DistributedHBaseCluster.java:282)
> at
> org.apache.hadoop.hbase.DistributedHBaseCluster.waitForRegionServerToStop(DistributedHBaseCluster.java:131)
> at
> org.apache.hadoop.hbase.chaos.actions.Action.suspendRs(Action.java:200)
> at
> org.apache.hadoop.hbase.chaos.actions.RollingBatchSuspendResumeRsAction.perform(RollingBatchSuspendResumeRsAction.java:97)
> at
> org.apache.hadoop.hbase.chaos.policies.DoActionsOncePolicy.runOneIteration(DoActionsOncePolicy.java:48)
> at
> org.apache.hadoop.hbase.chaos.policies.PeriodicPolicy.run(PeriodicPolicy.java:41)
> at
> org.apache.hadoop.hbase.chaos.policies.CompositeSequentialPolicy.run(CompositeSequentialPolicy.java:42)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:750)
> {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)