Hi Folks, We had some work done with replication back at HIVE-7973 and this implemented a primary mode of replication for hive which can integrate with tools like Falcon. I intend to move forward on continuing to improve this, to fix some of the major problems with the current implementation, mostly the following:
a) Replication follows a rubberbanding pattern, wherein different tables/ptns can be in a different/mixed state on the destination, so that unless all events are caught up on, we do not have an equivalent warehouse. Thus, this only satisfies DR cases, not load balancing usecases, and the secondary warehouse is really only seen as a backup, rather than as a live warehouse that trails the primary. b) The base implementation is a naive implementation, and has several performance problems, including a large amount of duplication of data for subsequent events, as mentioned in HIVE-13348, having to copy out entire partitions/tables when just a delta of files might be sufficient/etc. Also, using EXPORT/IMPORT allows us a simple implementation, but at the cost of tons of temporary space, much of which is not actually applied at the destination. To that end, I want to create a new branch, so that we can track development on this end on public apache jira. The last time I worked on this, having a private branch meant large uber patches as in HIVE-10227, which I would like to avoid this time, and is also more inkeeping with open-development. Also, developing in master itself is not a good idea, since some of the ideas I'm trying out can be experimental, and probably still a ways from maturity. So, unless anyone has any objection, I would like to create a new branch off master, say "repl2" and create an uber jira to manage individual components of the work. Thanks, -Sushanth