[ 
https://issues.apache.org/jira/browse/HBASE-30230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

terrytlu updated HBASE-30230:
-----------------------------
    Affects Version/s: 2.4.5

> truncate_preserve gets stuck when table has overlapping regions with same 
> startKey
> ----------------------------------------------------------------------------------
>
>                 Key: HBASE-30230
>                 URL: https://issues.apache.org/jira/browse/HBASE-30230
>             Project: HBase
>          Issue Type: Bug
>          Components: proc-v2
>    Affects Versions: 2.4.5
>            Reporter: terrytlu
>            Priority: Major
>
> Summary
> -------
> When executing `truncate_preserve` on a table that has overlapping regions 
> (regions sharing the same startKey), the procedure gets stuck indefinitely. 
> This occurs because during truncate, regions are cleaned and recreated at the 
> same timestamp, producing regions with identical encodedNames (derived from 
> tableName + startKey + regionId). The duplicate encodedNames cause race 
> conditions in subsequent procedure steps.
>  
> Environment
> -----------
>  - HBase version: 2.4.5
>  - Reproduction confirmed on internal test cluster
>  
> Symptoms
> --------
> {noformat}
>  # Multiple truncate procedures stuck in `TRUNCATE_TABLE_CREATE_FS_LAYOUT` 
> state
>  # Some truncate procedures stuck in `REGION_STATE_TRANSITION_CONFIRM_OPENED` 
> state
>  # Tables show RIT (Regions In Transition) after truncate_preserve
>  # The HMaster log shows a large number of FileNotFoundException errors for 
> .regioninfo files.
> {noformat}
>  
> Root Cause
> ----------
> When a table has pre-existing region overlap (multiple regions with the same 
> startKey), `truncate_preserve` deletes old regions and creates new ones 
> simultaneously. Since regionId is based on the creation timestamp, 
> overlapping regions created at the same instant produce identical 
> encodedNames (hash of tableName + startKey + regionId).
>  
> This leads to two failure scenarios:
>  
> *{*}Scenario 1 - Stuck at TRUNCATE_TABLE_CREATE_FS_LAYOUT:{*}*
> Concurrent threads attempt to create region directories for regions with the 
> same encodedName. The race condition causes:
> {noformat}
> truncate_preserve
>  → TRUNCATE_TABLE_CREATE_FS_LAYOUT
>  → Thread A: delete dir → create dir → write .regioninfo → init region
>  → Thread B: delete dir (race!) → destroys Thread A's .regioninfo
>  → Thread A: fails on createRegion() → retry forever (STUCK)
> {noformat}
>  
> *{*}Scenario 2 - Stuck at CONFIRM_REGION_OPEN:{*}*
> If both threads succeed in region initialization (Thread B deletes after 
> Thread A completes init), the procedure advances to ASSIGN_REGIONS. Master 
> sends two assign requests for the same encodedName. The RegionServer opens 
> the region on the first request but ignores the second (treating it as a 
> duplicate). The second sub-procedure waits forever for an RS report that 
> never comes, leaving the region in OPENING state.
>  
> Proposed Fix
> ------------
> Region overlaps are unavoidable in production environments (they can result 
> from interrupted split operations). If a user triggers truncate_preserve on 
> such a table, the procedure will get stuck indefinitely. Recovery requires 
> manual intervention with HBCK2 to manipulate metadata, which heavily depends 
> on the operator's understanding of the internal region state — leading to 
> significant repair effort and risk.
> I suggest adding a pre-check in `truncate_preserve` to detect region overlaps 
> in the target table. If overlaps are detected, reject the operation with a 
> clear error message instead of proceeding into an unrecoverable stuck state.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to