[ 
https://issues.apache.org/jira/browse/HBASE-22839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bin Shi updated HBASE-22839:
----------------------------
    Description: 
Problem Statement:

In the cross-cluster replication validation, we found some cells in 
master(source) cluster and slave(destination) cluster can have the same row 
key, the same timestamp but different values. The happens when mutations with 
the same row key are submitted in batch without specifying the timestamp, and 
the same timestamp in the unit of millisecond is assigned at the time when they 
are committed to the WAL. 

When this happens, if the major compaction hasn’t happened yet and you scan the 
table, you can find some cells have the same row key, the same timestamps but 
different values, like the first three rows in the following table.
|Row Key 1|CF0::Column 1|Timestatmp 1|Value 1|
|Row Key 1|CF0::Column 1|Timestatmp 1|Value 2|
|Row Key 1|CF0::Column 1|Timestatmp 1|Value 3|
|Row Key 2|CF0::Column 1|Timestatmp 2|Value 4|
|Row Key 3|CF0::Column 1|Timestatmp 4|Value 5|

The ordering of the first three rows is indeterminate in the presence of the 
cross-replication, so after compaction, in the master cluster you will see “Row 
Key 1, CF0::Column1, Timestamp1” having the value 3, but in the slave cluster, 
you might see the cell having one of the three possible values 1, 2, 3, which 
results in data inconsistency issue between the master and slave clusters.

Root Cause Analysis:

In HBaseInterClusterReplicationEndpoint.createBatches() of branch-1.3, the WAL 
entries from the same region could be split into different batches according to 
replication RPC limit and these batches are shipped by ReplicationSource 
concurrently, so the batches for the same region could arrive at the sink in 
the slave clusters then apply to the region synchronously in indeterminate 
order.

Solution:

In HBase 3.0.0 and 2.1.0, we provided Serial Replication HBASE-20046 which 
guarantees the order of pushing logs to slave clusters is same as the order of 
requests from client in the master cluster. It contains mainly two changes:
 # Recording the replication "barriers" in ZooKeeper to synchronize the 
replication across old/failed RS and new RS to provide strict ordering 
semantics even in the presence of region-move or RS failure.
 # Make sure the batches within one region are shipped to the slave clusters in 
order.

The second part of change is exactly what we need and the minimal change to fix 
the issue in this JIRA.

To fix the issue in this JIRA, we have two options:
 # Cherry-Pick HBASE-20046 to branch 1.3. Pros: It also fixes the data 
inconsistency issue when there is region-move or RS failure and help to reduce 
the noises in our cross-cluster replication/backup validation which is our 
ultimate goal. Cons: the change is big and I'm not sure for now whether the 
change is self-contained or it has other dependencies which need to port to 
branch 1.3 too; and we need longer time to validate and stabilize.  
 # Port the minimal change or make the equivalent change as the second part of 
HBASE-20046 to make sure the batches within one region are shipped to the slave 
clusters in order."

I prefer option 2 because of cons of option 1. Thoughts? 

  was:
Problem Statement:

In the cross-cluster replication validation, we found some cells in 
master(source) cluster and slave(destination) cluster can have the same row 
key, the same timestamp but different values. The happens when mutations with 
the same row key are submitted in batch without specifying the timestamp, and 
the same timestamp in the unit of millisecond is assigned at the time when they 
are committed to the WAL. 

When this happens, if the major compaction hasn’t happened yet and you scan the 
table, you can find some cells have the same row key, the same timestamps but 
different values, like the first three rows in the following table.
|Row Key 1|CF0::Column 1|Timestatmp 1|Value 1|
|Row Key 1|CF0::Column 1|Timestatmp 1|Value 2|
|Row Key 1|CF0::Column 1|Timestatmp 1|Value 3|
|Row Key 2|CF0::Column 1|Timestatmp 2|Value 4|
|Row Key 3|CF0::Column 1|Timestatmp 4|Value 5|

The ordering of the first three rows is indeterminate in the presence of the 
cross-replication, so after compaction, in the master cluster you will see “Row 
Key 1, CF0::Column1, Timestamp1” having the value 3, but in the slave cluster, 
you might see the cell having one of the three possible values 1, 2, 3, which 
results in data inconsistency issue between the master and slave clusters.

Root Cause Analysis:

In HBaseInterClusterReplicationEndpoint.createBatches() of branch-1.3, the WAL 
entries from the same region could be split into different batches according to 
replication RPC limit and these batches are shipped by ReplicationSource 
concurrently, so the batches for the same region could arrive at the sink on 
the region servers in the slave clusters then apply to the region in 
indeterminate order due to the synchronous nature of  cross-cluster replication.

Solution:

In HBase 3.0.0 and 2.1.0, we provided Serial Replication HBASE-20046 which 
guarantees the order of pushing logs to slave clusters is same as the order of 
requests from client in the master cluster. It contains mainly two changes:
 # Recording the replication "barriers" in ZooKeeper to synchronize the 
replication across old/failed RS and new RS to provide strict ordering 
semantics even in the presence of region-move or RS failure.
 # Make sure the batches within one region are shipped to the slave clusters in 
order.

The second part of change is exactly what we need and the minimal change to fix 
the issue in this JIRA.

To fix the issue in this JIRA, we have two options:
 # Cherry-Pick HBASE-20046 to branch 1.3. Pros: It also fixes the data 
inconsistency issue when there is region-move or RS failure and help to reduce 
the noises in our cross-cluster replication/backup validation which is our 
ultimate goal. Cons: the change is big and I'm not sure for now whether the 
change is self-contained or it has other dependencies which need to port to 
branch 1.3 too; and we need longer time to validate and stabilize.  
 # Port the minimal change or make the equivalent change as the second part of 
HBASE-20046 to make sure the batches within one region are shipped to the slave 
clusters in order."

I prefer option 2 because of cons of option 1. Thoughts? 


> Provide Serial Replication in HBase 1.3 to fix "row keys and timestamps are 
> the same but the values are different in the presence of cross-cluster 
> replication"
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-22839
>                 URL: https://issues.apache.org/jira/browse/HBASE-22839
>             Project: HBase
>          Issue Type: Improvement
>          Components: Replication
>    Affects Versions: 1.3.4, 1.3.5
>            Reporter: Bin Shi
>            Priority: Major
>             Fix For: 1.3.4, 1.3.5
>
>
> Problem Statement:
> In the cross-cluster replication validation, we found some cells in 
> master(source) cluster and slave(destination) cluster can have the same row 
> key, the same timestamp but different values. The happens when mutations with 
> the same row key are submitted in batch without specifying the timestamp, and 
> the same timestamp in the unit of millisecond is assigned at the time when 
> they are committed to the WAL. 
> When this happens, if the major compaction hasn’t happened yet and you scan 
> the table, you can find some cells have the same row key, the same timestamps 
> but different values, like the first three rows in the following table.
> |Row Key 1|CF0::Column 1|Timestatmp 1|Value 1|
> |Row Key 1|CF0::Column 1|Timestatmp 1|Value 2|
> |Row Key 1|CF0::Column 1|Timestatmp 1|Value 3|
> |Row Key 2|CF0::Column 1|Timestatmp 2|Value 4|
> |Row Key 3|CF0::Column 1|Timestatmp 4|Value 5|
> The ordering of the first three rows is indeterminate in the presence of the 
> cross-replication, so after compaction, in the master cluster you will see 
> “Row Key 1, CF0::Column1, Timestamp1” having the value 3, but in the slave 
> cluster, you might see the cell having one of the three possible values 1, 2, 
> 3, which results in data inconsistency issue between the master and slave 
> clusters.
> Root Cause Analysis:
> In HBaseInterClusterReplicationEndpoint.createBatches() of branch-1.3, the 
> WAL entries from the same region could be split into different batches 
> according to replication RPC limit and these batches are shipped by 
> ReplicationSource concurrently, so the batches for the same region could 
> arrive at the sink in the slave clusters then apply to the region 
> synchronously in indeterminate order.
> Solution:
> In HBase 3.0.0 and 2.1.0, we provided Serial Replication HBASE-20046 which 
> guarantees the order of pushing logs to slave clusters is same as the order 
> of requests from client in the master cluster. It contains mainly two changes:
>  # Recording the replication "barriers" in ZooKeeper to synchronize the 
> replication across old/failed RS and new RS to provide strict ordering 
> semantics even in the presence of region-move or RS failure.
>  # Make sure the batches within one region are shipped to the slave clusters 
> in order.
> The second part of change is exactly what we need and the minimal change to 
> fix the issue in this JIRA.
> To fix the issue in this JIRA, we have two options:
>  # Cherry-Pick HBASE-20046 to branch 1.3. Pros: It also fixes the data 
> inconsistency issue when there is region-move or RS failure and help to 
> reduce the noises in our cross-cluster replication/backup validation which is 
> our ultimate goal. Cons: the change is big and I'm not sure for now whether 
> the change is self-contained or it has other dependencies which need to port 
> to branch 1.3 too; and we need longer time to validate and stabilize.  
>  # Port the minimal change or make the equivalent change as the second part 
> of HBASE-20046 to make sure the batches within one region are shipped to the 
> slave clusters in order."
> I prefer option 2 because of cons of option 1. Thoughts? 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

Reply via email to