[
https://issues.apache.org/jira/browse/HBASE-30241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Xiao Liu updated HBASE-30241:
-----------------------------
Description:
h2. Symptom
{\{TestMobCloneSnapshotFromClientAfterSplittingRegion}} (and the non-MOB
{\{TestCloneSnapshotFromClientAfterSplittingRegion}}, which share a base class)
are flaky.
{\{testCloneSnapshotAfterSplittingRegion}} fails intermittently with:
{code}
org.opentest4j.AssertionFailedError: expected: not <null>
at
...CloneSnapshotFromClientAfterSplittingRegionTestBase.testCloneSnapshotAfterSplittingRegion:99
{code}
h2. Root cause
{\{CloneSnapshotFromClientAfterSplittingRegionTestBase.splitRegion()}} splits
on the second row,
expecting the snapshot to contain Reference files so that the cloned table
records the
parent-to-daughter split lineage (SPLITA/SPLITB) in meta -- the behavior
HBASE-29111 added the
{\{assertNotNull(daughter)}} check for.
However, since HBASE-26421, when a store file lies entirely on one side of the
split point the
split builds a whole-file HFileLink for the daughter instead of a Reference.
The test data is
generated from the MD5 of the wall-clock time
(\{{SnapshotTestingUtils.loadData}}), so the region
being split can end up with all of its store files on one side of the split
row. When that
happens the snapshot contains only HFileLinks and no Reference files.
{\{RestoreSnapshotHelper}} only records the parent-to-children mapping when it
restores a Reference
file (\{{restoreReferenceFile}}); the HFileLink path does not. So in the
all-HFileLink case the
cloned table's split parent has no SPLITA/SPLITB columns,
\{{MetaTableAccessor.getDaughterRegions}}
returns \{{(null, null)}}, and the assertion fails. Because it depends on
randomly generated keys,
the failure is non-deterministic.
This is purely a test-side issue. In the all-HFileLink case the daughters link
directly to the
snapshot files and do not depend on the cloned parent, so not recording
SPLITA/SPLITB is correct
and safe -- there is no data loss. The product behavior is fine; only the
test's assumption that a
split always yields References is wrong.
h2. Fix
Make the test deterministically produce a Reference file: in
\{{splitRegion()}}, major-compact each
region into a single store file spanning the whole region key range before
splitting. The split
row then always falls inside an existing file, which yields a Reference (and
the SPLITA/SPLITB
lineage the test asserts on). Automatic compaction is disabled on the region
server in this test,
so the region is compacted directly via \{{HRegion.compact(true)}} (a
synchronous compaction), and
auto-compaction stays disabled afterwards so the post-split reference files are
not compacted away.
h2. Additional coverage
Making the existing test deterministic means it now only exercises the
Reference path. To cover the
complementary all-HFileLink path -- which was previously only hit by chance --
add
{\{CloneSnapshotFromClientAfterSplittingRegionWithLinksTestBase}} (MOB and
non-MOB variants). It
writes two store files with disjoint key ranges and splits between them so each
daughter gets a
whole-file HFileLink, then asserts:
* the daughters contain only HFileLinks (no Reference files);
* the cloned table contains all rows;
* the cloned split parent records no SPLITA/SPLITB daughters (the complement of
HBASE-29111: with
whole-file links there is no parent-to-daughter mapping to record);
* the cloned table's data survives deletion of the source table and the
snapshot followed by an
HFile cleaner run, i.e. the HFileLink back-references keep the archived files
alive.
This change is test-only; there is no production code change.
> TestMobCloneSnapshotFromClientAfterSplittingRegion is flaky because a split
> may build whole-file HFileLinks instead of References
> ---------------------------------------------------------------------------------------------------------------------------------
>
> Key: HBASE-30241
> URL: https://issues.apache.org/jira/browse/HBASE-30241
> Project: HBase
> Issue Type: Sub-task
> Components: regionserver, test
> Reporter: Xiao Liu
> Assignee: Xiao Liu
> Priority: Major
> Fix For: 2.7.0, 3.0.0-beta-2, 2.5.16, 2.6.7
>
>
> h2. Symptom
> {\{TestMobCloneSnapshotFromClientAfterSplittingRegion}} (and the non-MOB
> {\{TestCloneSnapshotFromClientAfterSplittingRegion}}, which share a base
> class) are flaky.
> {\{testCloneSnapshotAfterSplittingRegion}} fails intermittently with:
> {code}
> org.opentest4j.AssertionFailedError: expected: not <null>
> at
> ...CloneSnapshotFromClientAfterSplittingRegionTestBase.testCloneSnapshotAfterSplittingRegion:99
> {code}
> h2. Root cause
> {\{CloneSnapshotFromClientAfterSplittingRegionTestBase.splitRegion()}} splits
> on the second row,
> expecting the snapshot to contain Reference files so that the cloned table
> records the
> parent-to-daughter split lineage (SPLITA/SPLITB) in meta -- the behavior
> HBASE-29111 added the
> {\{assertNotNull(daughter)}} check for.
> However, since HBASE-26421, when a store file lies entirely on one side of
> the split point the
> split builds a whole-file HFileLink for the daughter instead of a Reference.
> The test data is
> generated from the MD5 of the wall-clock time
> (\{{SnapshotTestingUtils.loadData}}), so the region
> being split can end up with all of its store files on one side of the split
> row. When that
> happens the snapshot contains only HFileLinks and no Reference files.
> {\{RestoreSnapshotHelper}} only records the parent-to-children mapping when
> it restores a Reference
> file (\{{restoreReferenceFile}}); the HFileLink path does not. So in the
> all-HFileLink case the
> cloned table's split parent has no SPLITA/SPLITB columns,
> \{{MetaTableAccessor.getDaughterRegions}}
> returns \{{(null, null)}}, and the assertion fails. Because it depends on
> randomly generated keys,
> the failure is non-deterministic.
> This is purely a test-side issue. In the all-HFileLink case the daughters
> link directly to the
> snapshot files and do not depend on the cloned parent, so not recording
> SPLITA/SPLITB is correct
> and safe -- there is no data loss. The product behavior is fine; only the
> test's assumption that a
> split always yields References is wrong.
> h2. Fix
> Make the test deterministically produce a Reference file: in
> \{{splitRegion()}}, major-compact each
> region into a single store file spanning the whole region key range before
> splitting. The split
> row then always falls inside an existing file, which yields a Reference (and
> the SPLITA/SPLITB
> lineage the test asserts on). Automatic compaction is disabled on the region
> server in this test,
> so the region is compacted directly via \{{HRegion.compact(true)}} (a
> synchronous compaction), and
> auto-compaction stays disabled afterwards so the post-split reference files
> are not compacted away.
> h2. Additional coverage
> Making the existing test deterministic means it now only exercises the
> Reference path. To cover the
> complementary all-HFileLink path -- which was previously only hit by chance
> -- add
> {\{CloneSnapshotFromClientAfterSplittingRegionWithLinksTestBase}} (MOB and
> non-MOB variants). It
> writes two store files with disjoint key ranges and splits between them so
> each daughter gets a
> whole-file HFileLink, then asserts:
> * the daughters contain only HFileLinks (no Reference files);
> * the cloned table contains all rows;
> * the cloned split parent records no SPLITA/SPLITB daughters (the complement
> of HBASE-29111: with
> whole-file links there is no parent-to-daughter mapping to record);
> * the cloned table's data survives deletion of the source table and the
> snapshot followed by an
> HFile cleaner run, i.e. the HFileLink back-references keep the archived files
> alive.
> This change is test-only; there is no production code change.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)