[ 
https://issues.apache.org/jira/browse/HBASE-30241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Liu updated HBASE-30241:
-----------------------------
    Environment:     (was: h2. Symptom

{\{TestMobCloneSnapshotFromClientAfterSplittingRegion}} (and the non-MOB
{\{TestCloneSnapshotFromClientAfterSplittingRegion}}, which share a base class) 
are flaky.
{\{testCloneSnapshotAfterSplittingRegion}} fails intermittently with:

{code}
org.opentest4j.AssertionFailedError: expected: not <null>
at 
...CloneSnapshotFromClientAfterSplittingRegionTestBase.testCloneSnapshotAfterSplittingRegion:99
{code}

h2. Root cause

{\{CloneSnapshotFromClientAfterSplittingRegionTestBase.splitRegion()}} splits 
on the second row,
expecting the snapshot to contain Reference files so that the cloned table 
records the
parent-to-daughter split lineage (SPLITA/SPLITB) in meta -- the behavior 
HBASE-29111 added the
{\{assertNotNull(daughter)}} check for.

However, since HBASE-26421, when a store file lies entirely on one side of the 
split point the
split builds a whole-file HFileLink for the daughter instead of a Reference. 
The test data is
generated from the MD5 of the wall-clock time 
(\{{SnapshotTestingUtils.loadData}}), so the region
being split can end up with all of its store files on one side of the split 
row. When that
happens the snapshot contains only HFileLinks and no Reference files.

{\{RestoreSnapshotHelper}} only records the parent-to-children mapping when it 
restores a Reference
file (\{{restoreReferenceFile}}); the HFileLink path does not. So in the 
all-HFileLink case the
cloned table's split parent has no SPLITA/SPLITB columns, 
\{{MetaTableAccessor.getDaughterRegions}}
returns \{{(null, null)}}, and the assertion fails. Because it depends on 
randomly generated keys,
the failure is non-deterministic.

This is purely a test-side issue. In the all-HFileLink case the daughters link 
directly to the
snapshot files and do not depend on the cloned parent, so not recording 
SPLITA/SPLITB is correct
and safe -- there is no data loss. The product behavior is fine; only the 
test's assumption that a
split always yields References is wrong.

h2. Fix

Make the test deterministically produce a Reference file: in 
\{{splitRegion()}}, major-compact each
region into a single store file spanning the whole region key range before 
splitting. The split
row then always falls inside an existing file, which yields a Reference (and 
the SPLITA/SPLITB
lineage the test asserts on). Automatic compaction is disabled on the region 
server in this test,
so the region is compacted directly via \{{HRegion.compact(true)}} (a 
synchronous compaction), and
auto-compaction stays disabled afterwards so the post-split reference files are 
not compacted away.

h2. Additional coverage

Making the existing test deterministic means it now only exercises the 
Reference path. To cover the
complementary all-HFileLink path -- which was previously only hit by chance -- 
add
{\{CloneSnapshotFromClientAfterSplittingRegionWithLinksTestBase}} (MOB and 
non-MOB variants). It
writes two store files with disjoint key ranges and splits between them so each 
daughter gets a
whole-file HFileLink, then asserts:
* the daughters contain only HFileLinks (no Reference files);
* the cloned table contains all rows;
* the cloned split parent records no SPLITA/SPLITB daughters (the complement of 
HBASE-29111: with
whole-file links there is no parent-to-daughter mapping to record);
* the cloned table's data survives deletion of the source table and the 
snapshot followed by an
HFile cleaner run, i.e. the HFileLink back-references keep the archived files 
alive.

This change is test-only; there is no production code change.)

> TestMobCloneSnapshotFromClientAfterSplittingRegion is flaky because a split 
> may build whole-file HFileLinks instead of References
> ---------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-30241
>                 URL: https://issues.apache.org/jira/browse/HBASE-30241
>             Project: HBase
>          Issue Type: Sub-task
>          Components: regionserver, test
>            Reporter: Xiao Liu
>            Assignee: Xiao Liu
>            Priority: Major
>             Fix For: 2.7.0, 3.0.0-beta-2, 2.5.16, 2.6.7
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to