[ 
https://issues.apache.org/jira/browse/HIVE-1161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14353569#comment-14353569
 ] 

Sushanth Sowmyan commented on HIVE-1161:
----------------------------------------

Hi,

I've (along with [~alangates]) actually been working on this along with some 
engineers ( [~bvellanki], [~sowmyaramesh] & [~peeyushb] ) working on Falcon for 
the last couple of months, and I have a proof of concept working between us 
that I'm in the process of uploading as patches to apache. The original 
umbrella jira I filed for this was in 
https://issues.apache.org/jira/browse/HIVE-7973 , but that info is a bit 
outdated since. Over the next couple of weeks, I'll file jiras for all the 
additional work that we've done under that.

The basic idea behind that replication scheme is very similar to the one 
detailed above - we have a notion of Events that get triggered each time any 
metadata action happens (code for this is already checked in to apache trunk : 
see HIVE-9174, HIVE-9321, HIVE-9175, HIVE-9184, HIVE-9271, HIVE-9273, 
HIVE-9501, HIVE-9550, HIVE-9359 )

Once we have our events, we have a library that converts those into 
ReplicationTasks, which encapsulate what is needed to be done in the source 
warehouse and the destination warehouse to replicate the event, and what needs 
distcp-ing.

Then, there are changes to EXPORT & IMPORT to tag additional information in the 
dump that shows that this is for replication, and to keep track of state at the 
destination side, so that older events (if replayed) do not trash newer 
information. In conjunction, there was a necessary change to DROPs as well to 
make sure they followed a "DROP IF OLDER" kind of semantic.

Ultimately, there was a fair amount of integration and testing from Falcon's 
side to make sure that the whole system worked end-to-end. So far, our proof of 
concept is working well enough to contribute now.

> Hive Replication
> ----------------
>
>                 Key: HIVE-1161
>                 URL: https://issues.apache.org/jira/browse/HIVE-1161
>             Project: Hive
>          Issue Type: New Feature
>          Components: Contrib
>            Reporter: Edward Capriolo
>            Assignee: SHAILESH PILARE
>            Priority: Minor
>
> Users may want to replicate data between two distinct hadoop clusters or two 
> hive warehouses on the same cluster.
> Users may want to replicate entire catalogs or possibly on a table by table 
> basis. Should this process be batch driven or a be a full time running 
> application? What are some practical requirements, what are the limitations?
> Comments?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to