[
https://issues.apache.org/jira/browse/HBASE-20475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16465759#comment-16465759
]
Zheng Hu edited comment on HBASE-20475 at 5/7/18 10:46 AM:
-----------------------------------------------------------
Checked the UT & log again, the phenomenon is:
{code:java}
<!--- testEditsBehindDroppedTableTiming begin
1. add peer
2. restart cluster to keep only on rs, and create table;
3. disable peer;
4. put a row in the test_dropped table;
5. put 1000 row in the test table;
6. drop the test_dropped table;
7. enable peer;
8. we expected that the last row (rowKey=999) would not exist in peer cluster,
but failed...
<!--- testEditsBehindDroppedTableTiming end
{code}
I think the potential problems are:
1. In HBaseInterClusterReplicationEndpoint, we hashed the encoded region name,
divided entries into batches, and replicate them in order by one thread.
there's possible the batches are groupped as:
{code:java}
batch-1 : [500,...,998, 999]
batch-2 : [row_in_test_dropped_table, 0, 1, 2, 3, ..., 499 ]
{code}
The batch-1 replicated firslty, then the row=999 would be replicated to peer
cluster.
2. All UT use the same rowkey range for the put 1000 rows without cleaning. so
one UT may effect the another.
was (Author: openinx):
Checked the UT & log again, the phenomenon is:
{code:java}
<!--- testEditsBehindDroppedTableTiming begin
1. add peer
2. restart cluster to keep only on rs, and create table;
3. disable peer;
4. put a row in the test_dropped table;
5. put 1000 row in the test table;
6. enable peer;
7. we expected that the last row (rowKey=999) would not exist in peer cluster,
but failed...
<!--- testEditsBehindDroppedTableTiming end
{code}
I think the potential problems are:
1. In HBaseInterClusterReplicationEndpoint, we hashed the encoded region name,
divided entries into batches, and replicate them in order by one thread.
there's possible the batches are groupped as:
{code:java}
batch-1 : [500,...,998, 999]
batch-2 : [row_in_test_dropped_table, 0, 1, 2, 3, ..., 499 ]
{code}
The batch-1 replicated firslty, then the row=999 would be replicated to peer
cluster.
2. All UT use the same rowkey range for the put 1000 rows without cleaning. so
one UT may effect the another.
> Fix the flaky TestReplicationDroppedTables unit test.
> -----------------------------------------------------
>
> Key: HBASE-20475
> URL: https://issues.apache.org/jira/browse/HBASE-20475
> Project: HBase
> Issue Type: Bug
> Affects Versions: 2.1.0
> Reporter: Zheng Hu
> Assignee: Zheng Hu
> Priority: Major
> Fix For: 3.0.0, 2.1.0
>
> Attachments: HBASE-20475-addendum-v2.patch,
> HBASE-20475-addendum-v3.patch, HBASE-20475-addendum.patch, HBASE-20475.patch
>
>
> See
> https://builds.apache.org/job/HBASE-Find-Flaky-Tests/lastSuccessfulBuild/artifact/dashboard.html
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)