[jira] [Commented] (HBASE-13153) enable bulkload to support replication

Bhupendra Kumar Jain (JIRA) Wed, 02 Sep 2015 06:22:41 -0700

    [ 
https://issues.apache.org/jira/browse/HBASE-13153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14727319#comment-14727319
 ]


Bhupendra Kumar Jain commented on HBASE-13153:
----------------------------------------------

Thanks all for the review and nice comments. 
[~apurtell]
bq. Since cyclic replication topologies are supported today I think we'd need 
that handled for the bulk load case too or it will lead to subtle and not so 
subtle problems for users.
To detect the cyclic replication case, we will make use of hbase cluster's 
unique id. [ Same as WAL replication]
The unique cluster id of all source hbase clusters will be persisted in ZK of 
Peer cluster under hfile replication node.

For an example ->       c1->c2->c3->c1 is the cyclic replication case
So when file f1 is bulk loaded to c1 cluster and then from c1->c2 and c2->c3, 
below is the sample zk node data
||Cluster||hfile node data
|c1|    f1,{NONE}
|c2|    f1,{c1}
|c3|    f1,{c1,c2}

When c3 tries to replicate the f1 to c1, it will detect the cycle and will not 
process further. Unique cluster id of all sources will be passed to next 
replication request. 

bq. A crazy idea: rather than have bulk load tooling produce only HFiles for 
replication, why not HFiles for the local cluster and ready made WALs to queue 
up for replication? Of course that's going to have some drawbacks too but I 
think fewer.
We thought similar ideas initially, but didn't take this approach because,
This way we will not get the benefit of bulk load. If we simulate the bulk load 
hfile replication as WAL, then it will actually become many Puts in peer 
cluster and not bulk load. But as per our approach, the hfile will be copied 
and loaded to peer cluster similar as Complete Bulk Load flow, so it will have 
same benefit of bulk load mechanism.


[~tedyu]
bq. Could sequence Id be used so that the HFiles don't need to be written again 
?
As we think, To detect the cycle, Sequence ID can not be used because Sequence 
id's for hfile will be different across clusters and it doesn't provide any 
hint of source cluster.

[~ram_krish]
bq. Few things to consider, ensure that if there is block encoding then the 
encoding scheme is same in both the tables. These type of conditions may come 
in the initial pre checks that we may need to add.
This scenario is similar to changing the encoding in one running hbase cluster. 
Some hfiles will be of X encoding and others will be of Y encoding. Each hfile 
is aware of its encoding type. As I know, this is already handled as part of 
hfile read. So replication of hfile should not have any issue. Correct me if I 
am missing anything ? 

> enable bulkload to support replication
> --------------------------------------
>
>                 Key: HBASE-13153
>                 URL: https://issues.apache.org/jira/browse/HBASE-13153
>             Project: HBase
>          Issue Type: New Feature
>          Components: Replication
>            Reporter: sunhaitao
>            Assignee: Ashish Singhi
>             Fix For: 2.0.0
>
>         Attachments: HBase Bulk Load Replication.pdf
>
>
> Currently we plan to use HBase Replication feature to deal with disaster 
> tolerance scenario.But we encounter an issue that we will use bulkload very 
> frequently,because bulkload bypass write path, and will not generate WAL, so 
> the data will not be replicated to backup cluster. It's inappropriate to 
> bukload twice both on active cluster and backup cluster. So i advise do some 
> modification to bulkload feature to enable bukload to both active cluster and 
> backup cluster



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HBASE-13153) enable bulkload to support replication

Reply via email to