[jira] [Updated] (HBASE-21757) retrying to close a region incorrectly resets its RIT age metric

Sergey Shelukhin (JIRA) Tue, 22 Jan 2019 12:24:15 -0800


     [ 
https://issues.apache.org/jira/browse/HBASE-21757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Sergey Shelukhin updated HBASE-21757:
-------------------------------------
    Description: 
We have a region stuck in RIT forever due to some other bug -that I will file 
later- ok we don't have enough logs for the other bug anymore it looks like. So 
there's just the reporting issue for now.
Every 10 minutes it does the typical split-brain retry; I noticed that this 
retry resets the region's RIT age, so the "oldest RIT" metric never becomes 
larger than ~10mins even though the region has been stuck for days.

{noformat}
2019-01-22 10:40:52,993 INFO  [PEWorker-10] assignment.RegionStateStore: 
pid=1865 updating hbase:meta row=region, regionState=CLOSING, 
regionLocation=server,17020,1547824687684
2019-01-22 10:40:53,025 WARN  [PEWorker-10] 
assignment.RegionRemoteProcedureBase: Can not add remote operation pid=29297, 
ppid=1865, state=RUNNABLE, hasLock=true; 
org.apache.hadoop.hbase.master.assignment.CloseRegionProcedure for region 
{ENCODED => region, ...} to server server,17020,1547824687684, this usually 
because the server is alread dead, give up and mark the procedure as complete, 
the parent procedure will take care of this.
2019-01-22 10:40:53,040 INFO  [PEWorker-10] procedure2.ProcedureExecutor: 
Finished subprocedure(s) of pid=1865, 
state=RUNNABLE:REGION_STATE_TRANSITION_CONFIRM_CLOSED, hasLock=true; 
TransitRegionStateProcedure table=table, region=region, REOPEN/MOVE; resume 
parent processing.
2019-01-22 10:40:53,040 WARN  [PEWorker-7] 
assignment.TransitRegionStateProcedure: Failed transition, suspend 600secs 
pid=1865, state=RUNNABLE:REGION_STATE_TRANSITION_CLOSE, hasLock=true; 
TransitRegionStateProcedure table=table, region=region, REOPEN/MOVE; 
rit=CLOSING, location=server,17020,1547824687684; waiting on rectified 
condition fixed by other Procedure or operator intervention
2019-01-22 10:40:53,040 INFO  [PEWorker-7] procedure2.TimeoutExecutorThread: 
ADDED pid=1865, state=WAITING_TIMEOUT:REGION_STATE_TRANSITION_CLOSE, 
hasLock=true; TransitRegionStateProcedure table=table, region=region, 
REOPEN/MOVE; timeout=600000, timestamp=1548183053040
{noformat}

 !screenshot-1.png!  

  was:
We have a region stuck in RIT forever -due to some other bug that I will file 
later- ok we don't have enough logs for the other bug anymore it looks like. So 
there's just the reporting issue for now.
Every 10 minutes it does the typical split-brain retry; I noticed that this 
retry resets the region's RIT age, so the "oldest RIT" metric never becomes 
larger than ~10mins even though the region has been stuck for days.

{noformat}
2019-01-22 10:40:52,993 INFO  [PEWorker-10] assignment.RegionStateStore: 
pid=1865 updating hbase:meta row=region, regionState=CLOSING, 
regionLocation=server,17020,1547824687684
2019-01-22 10:40:53,025 WARN  [PEWorker-10] 
assignment.RegionRemoteProcedureBase: Can not add remote operation pid=29297, 
ppid=1865, state=RUNNABLE, hasLock=true; 
org.apache.hadoop.hbase.master.assignment.CloseRegionProcedure for region 
{ENCODED => region, ...} to server server,17020,1547824687684, this usually 
because the server is alread dead, give up and mark the procedure as complete, 
the parent procedure will take care of this.
2019-01-22 10:40:53,040 INFO  [PEWorker-10] procedure2.ProcedureExecutor: 
Finished subprocedure(s) of pid=1865, 
state=RUNNABLE:REGION_STATE_TRANSITION_CONFIRM_CLOSED, hasLock=true; 
TransitRegionStateProcedure table=table, region=region, REOPEN/MOVE; resume 
parent processing.
2019-01-22 10:40:53,040 WARN  [PEWorker-7] 
assignment.TransitRegionStateProcedure: Failed transition, suspend 600secs 
pid=1865, state=RUNNABLE:REGION_STATE_TRANSITION_CLOSE, hasLock=true; 
TransitRegionStateProcedure table=table, region=region, REOPEN/MOVE; 
rit=CLOSING, location=server,17020,1547824687684; waiting on rectified 
condition fixed by other Procedure or operator intervention
2019-01-22 10:40:53,040 INFO  [PEWorker-7] procedure2.TimeoutExecutorThread: 
ADDED pid=1865, state=WAITING_TIMEOUT:REGION_STATE_TRANSITION_CLOSE, 
hasLock=true; TransitRegionStateProcedure table=table, region=region, 
REOPEN/MOVE; timeout=600000, timestamp=1548183053040
{noformat}

 !screenshot-1.png!  


> retrying to close a region incorrectly resets its RIT age metric
> ----------------------------------------------------------------
>
>                 Key: HBASE-21757
>                 URL: https://issues.apache.org/jira/browse/HBASE-21757
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 3.0.0
>            Reporter: Sergey Shelukhin
>            Priority: Major
>         Attachments: screenshot-1.png
>
>
> We have a region stuck in RIT forever due to some other bug -that I will file 
> later- ok we don't have enough logs for the other bug anymore it looks like. 
> So there's just the reporting issue for now.
> Every 10 minutes it does the typical split-brain retry; I noticed that this 
> retry resets the region's RIT age, so the "oldest RIT" metric never becomes 
> larger than ~10mins even though the region has been stuck for days.
> {noformat}
> 2019-01-22 10:40:52,993 INFO  [PEWorker-10] assignment.RegionStateStore: 
> pid=1865 updating hbase:meta row=region, regionState=CLOSING, 
> regionLocation=server,17020,1547824687684
> 2019-01-22 10:40:53,025 WARN  [PEWorker-10] 
> assignment.RegionRemoteProcedureBase: Can not add remote operation pid=29297, 
> ppid=1865, state=RUNNABLE, hasLock=true; 
> org.apache.hadoop.hbase.master.assignment.CloseRegionProcedure for region 
> {ENCODED => region, ...} to server server,17020,1547824687684, this usually 
> because the server is alread dead, give up and mark the procedure as 
> complete, the parent procedure will take care of this.
> 2019-01-22 10:40:53,040 INFO  [PEWorker-10] procedure2.ProcedureExecutor: 
> Finished subprocedure(s) of pid=1865, 
> state=RUNNABLE:REGION_STATE_TRANSITION_CONFIRM_CLOSED, hasLock=true; 
> TransitRegionStateProcedure table=table, region=region, REOPEN/MOVE; resume 
> parent processing.
> 2019-01-22 10:40:53,040 WARN  [PEWorker-7] 
> assignment.TransitRegionStateProcedure: Failed transition, suspend 600secs 
> pid=1865, state=RUNNABLE:REGION_STATE_TRANSITION_CLOSE, hasLock=true; 
> TransitRegionStateProcedure table=table, region=region, REOPEN/MOVE; 
> rit=CLOSING, location=server,17020,1547824687684; waiting on rectified 
> condition fixed by other Procedure or operator intervention
> 2019-01-22 10:40:53,040 INFO  [PEWorker-7] procedure2.TimeoutExecutorThread: 
> ADDED pid=1865, state=WAITING_TIMEOUT:REGION_STATE_TRANSITION_CLOSE, 
> hasLock=true; TransitRegionStateProcedure table=table, region=region, 
> REOPEN/MOVE; timeout=600000, timestamp=1548183053040
> {noformat}
>  !screenshot-1.png!  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (HBASE-21757) retrying to close a region incorrectly resets its RIT age metric

Reply via email to