Sushanth Sowmyan updated HIVE-13348:
Labels: (was: gsoc2016)
> Add Event Nullification support for Replication
> Key: HIVE-13348
> URL: https://issues.apache.org/jira/browse/HIVE-13348
> Project: Hive
> Issue Type: Sub-task
> Components: Import/Export
> Reporter: Sushanth Sowmyan
> Replication, as implemented by HIVE-7973 works as follows:
> a) For every singly modification to the hive metastore, an event gets
> triggered that logs a notification object.
> b) Replication tools such as falcon can consume these notification objects as
> a HCatReplicationTaskIterator from
> HCatClient.getReplicationTasks(lastEventId, maxEvents, dbName, tableName).
> c) For each event, we generate statements and distcp requirements for falcon
> to export, distcp and import to do the replication (along with requisite
> changes to export and import that would allow state management).
> The big thing missing from this picture is that while it works, it is pretty
> dumb about how it works in that it will exhaustively process every single
> event generated, and will try to do the export-distcp-import cycle for all
> modifications, irrespective of whether or not that will actually get used at
> import time.
> We need to build some sort of filtering logic which can process a batch of
> events to identify events that will result in effective no-ops, and to
> nullify those events from the stream before passing them on. The goal is to
> minimize the number of events that the tools like Falcon would actually have
> to process.
> Examples of cases where event nullification would take place:
> a) CREATE-DROP cases: If an object is being created in event#34 that will
> eventually get dropped in event#47, then there is no point in replicating
> this along. We simply null out both these events, and also, any other event
> that references this object between event#34 and event#47.
> b) APPEND-APPEND : Some objects are replicated wholesale, which means every
> APPEND that occurs would cause a full export of the object in question. At
> this point, the prior APPENDS would all be supplanted by the last APPEND.
> Thus, we could nullify all the prior such events.
> Additional such cases can be inferred by analysis of the Export-Import relay
> protocol definition at
> or by reasoning out various event processing orders possible.
> Replication, as implemented by HIVE-7973 is merely a first step for
> functional support. This work is needed for replication to be efficient at
> all, and thus, usable.
This message was sent by Atlassian JIRA