[
https://issues.apache.org/jira/browse/HIVE-1161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14353569#comment-14353569
]
Sushanth Sowmyan commented on HIVE-1161:
----------------------------------------
Hi,
I've (along with [~alangates]) actually been working on this along with some
engineers ( [~bvellanki], [~sowmyaramesh] & [~peeyushb] ) working on Falcon for
the last couple of months, and I have a proof of concept working between us
that I'm in the process of uploading as patches to apache. The original
umbrella jira I filed for this was in
https://issues.apache.org/jira/browse/HIVE-7973 , but that info is a bit
outdated since. Over the next couple of weeks, I'll file jiras for all the
additional work that we've done under that.
The basic idea behind that replication scheme is very similar to the one
detailed above - we have a notion of Events that get triggered each time any
metadata action happens (code for this is already checked in to apache trunk :
see HIVE-9174, HIVE-9321, HIVE-9175, HIVE-9184, HIVE-9271, HIVE-9273,
HIVE-9501, HIVE-9550, HIVE-9359 )
Once we have our events, we have a library that converts those into
ReplicationTasks, which encapsulate what is needed to be done in the source
warehouse and the destination warehouse to replicate the event, and what needs
distcp-ing.
Then, there are changes to EXPORT & IMPORT to tag additional information in the
dump that shows that this is for replication, and to keep track of state at the
destination side, so that older events (if replayed) do not trash newer
information. In conjunction, there was a necessary change to DROPs as well to
make sure they followed a "DROP IF OLDER" kind of semantic.
Ultimately, there was a fair amount of integration and testing from Falcon's
side to make sure that the whole system worked end-to-end. So far, our proof of
concept is working well enough to contribute now.
> Hive Replication
> ----------------
>
> Key: HIVE-1161
> URL: https://issues.apache.org/jira/browse/HIVE-1161
> Project: Hive
> Issue Type: New Feature
> Components: Contrib
> Reporter: Edward Capriolo
> Assignee: SHAILESH PILARE
> Priority: Minor
>
> Users may want to replicate data between two distinct hadoop clusters or two
> hive warehouses on the same cluster.
> Users may want to replicate entire catalogs or possibly on a table by table
> basis. Should this process be batch driven or a be a full time running
> application? What are some practical requirements, what are the limitations?
> Comments?
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)