[ https://issues.apache.org/jira/browse/HIVE-1161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14353569#comment-14353569 ]
Sushanth Sowmyan commented on HIVE-1161: ---------------------------------------- Hi, I've (along with [~alangates]) actually been working on this along with some engineers ( [~bvellanki], [~sowmyaramesh] & [~peeyushb] ) working on Falcon for the last couple of months, and I have a proof of concept working between us that I'm in the process of uploading as patches to apache. The original umbrella jira I filed for this was in https://issues.apache.org/jira/browse/HIVE-7973 , but that info is a bit outdated since. Over the next couple of weeks, I'll file jiras for all the additional work that we've done under that. The basic idea behind that replication scheme is very similar to the one detailed above - we have a notion of Events that get triggered each time any metadata action happens (code for this is already checked in to apache trunk : see HIVE-9174, HIVE-9321, HIVE-9175, HIVE-9184, HIVE-9271, HIVE-9273, HIVE-9501, HIVE-9550, HIVE-9359 ) Once we have our events, we have a library that converts those into ReplicationTasks, which encapsulate what is needed to be done in the source warehouse and the destination warehouse to replicate the event, and what needs distcp-ing. Then, there are changes to EXPORT & IMPORT to tag additional information in the dump that shows that this is for replication, and to keep track of state at the destination side, so that older events (if replayed) do not trash newer information. In conjunction, there was a necessary change to DROPs as well to make sure they followed a "DROP IF OLDER" kind of semantic. Ultimately, there was a fair amount of integration and testing from Falcon's side to make sure that the whole system worked end-to-end. So far, our proof of concept is working well enough to contribute now. > Hive Replication > ---------------- > > Key: HIVE-1161 > URL: https://issues.apache.org/jira/browse/HIVE-1161 > Project: Hive > Issue Type: New Feature > Components: Contrib > Reporter: Edward Capriolo > Assignee: SHAILESH PILARE > Priority: Minor > > Users may want to replicate data between two distinct hadoop clusters or two > hive warehouses on the same cluster. > Users may want to replicate entire catalogs or possibly on a table by table > basis. Should this process be batch driven or a be a full time running > application? What are some practical requirements, what are the limitations? > Comments? -- This message was sent by Atlassian JIRA (v6.3.4#6332)