We had some work done with replication back at HIVE-7973 and this
implemented a primary mode of replication for hive which can integrate
with tools like Falcon. I intend to move forward on continuing to
improve this, to fix some of the major problems with the current
implementation, mostly the following:
a) Replication follows a rubberbanding pattern, wherein different
tables/ptns can be in a different/mixed state on the destination, so
that unless all events are caught up on, we do not have an equivalent
warehouse. Thus, this only satisfies DR cases, not load balancing
usecases, and the secondary warehouse is really only seen as a backup,
rather than as a live warehouse that trails the primary.
b) The base implementation is a naive implementation, and has several
performance problems, including a large amount of duplication of data
for subsequent events, as mentioned in HIVE-13348, having to copy out
entire partitions/tables when just a delta of files might be
sufficient/etc. Also, using EXPORT/IMPORT allows us a simple
implementation, but at the cost of tons of temporary space, much of
which is not actually applied at the destination.
To that end, I want to create a new branch, so that we can track
development on this end on public apache jira. The last time I worked
on this, having a private branch meant large uber patches as in
HIVE-10227, which I would like to avoid this time, and is also more
inkeeping with open-development. Also, developing in master itself is
not a good idea, since some of the ideas I'm trying out can be
experimental, and probably still a ways from maturity.
So, unless anyone has any objection, I would like to create a new
branch off master, say "repl2" and create an uber jira to manage
individual components of the work.