[ https://issues.apache.org/jira/browse/HBASE-28023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Duo Zhang resolved HBASE-28023. ------------------------------- Fix Version/s: 2.5.11 2.6.2 Hadoop Flags: Reviewed Resolution: Fixed Pushed to all active branches. Thanks [~luoen] for contributing! > ITBLL's RollingBatchSuspendResumeRsAction runs the "suspendRs" method to > perform the action, but it inadvertently uses the "waitForRegionServerToStop" > method to check if it was executed successfully. > ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- > > Key: HBASE-28023 > URL: https://issues.apache.org/jira/browse/HBASE-28023 > Project: HBase > Issue Type: Bug > Affects Versions: 3.0.0-alpha-1, 2.7.0 > Reporter: Haiping lv > Assignee: Haiping lv > Priority: Major > Labels: pull-request-available > Fix For: 2.7.0, 3.0.0-beta-2, 2.5.11, 2.6.2 > > > When running ITBLL, a problem occurs that ultimately results in all region > servers being suspended. > The following is the ITBLL running command: > {code:java} > hbase org.apache.hadoop.hbase.test.IntegrationTestBigLinkedList > -DIntegrationTestBigLinkedList.table=itbll -m slowDeterministic Loop 10 10 > 10000000 /tmp/biglinkedlist 100 {code} > I have summarized the process as follows: > # The Action RollingBatchSuspendResumeRsAction in ITBLL will execute the > "sudo -u hbase ps ux | grep proc_regionserver | grep -v grep | tr -s ' ' | > cut -d ' ' -f2 | xargs kill -s SIGSTOP" command to suspend the RegionServer > process. > # This command will pause the RegionServer process, rather than kill it. > # The Action uses the waitForServiceToStop method to check if the execution > was successful, using the "sudo -u hbase ps ux | grep proc_regionserver | > grep -v grep | tr -s ' ' | cut -d ' ' -f2" command. > # The waitForServiceToStop method used to check if the execution was > successful does not match the suspendRs, causing ITBLL to not resume the > RegionServer process and ultimately resulting in all RegionServer processes > being suspended. Therefore, ITBLL fails to run. > {code:java} > 2023-07-21 11:18:23,103 WARN [ChaosMonkey-2] policies.Policy > (DoActionsOncePolicy.java:runOneIteration(51)) - Exception occurred during > performing action: java.io.IOException: Timed-out waiting for service to > stop: core-1-3,16020,1689908619650 > at > org.apache.hadoop.hbase.DistributedHBaseCluster.waitForServiceToStop(DistributedHBaseCluster.java:282) > at > org.apache.hadoop.hbase.DistributedHBaseCluster.waitForRegionServerToStop(DistributedHBaseCluster.java:131) > at > org.apache.hadoop.hbase.chaos.actions.Action.suspendRs(Action.java:200) > at > org.apache.hadoop.hbase.chaos.actions.RollingBatchSuspendResumeRsAction.perform(RollingBatchSuspendResumeRsAction.java:97) > at > org.apache.hadoop.hbase.chaos.policies.DoActionsOncePolicy.runOneIteration(DoActionsOncePolicy.java:48) > at > org.apache.hadoop.hbase.chaos.policies.PeriodicPolicy.run(PeriodicPolicy.java:41) > at > org.apache.hadoop.hbase.chaos.policies.CompositeSequentialPolicy.run(CompositeSequentialPolicy.java:42) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:750) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)