[
https://issues.apache.org/jira/browse/HIVE-21197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
mahesh kumar behera updated HIVE-21197:
---------------------------------------
Status: Patch Available (was: Open)
> Hive replication can add duplicate data during migration to a target with
> hive.strict.managed.tables enabled
> ------------------------------------------------------------------------------------------------------------
>
> Key: HIVE-21197
> URL: https://issues.apache.org/jira/browse/HIVE-21197
> Project: Hive
> Issue Type: Task
> Components: repl
> Reporter: mahesh kumar behera
> Assignee: mahesh kumar behera
> Priority: Major
> Labels: pull-request-available
> Attachments: HIVE-21197.01.patch, HIVE-21197.02.patch,
> HIVE-21197.03.patch
>
> Time Spent: 11h 10m
> Remaining Estimate: 0h
>
> During bootstrap phase it may happen that the files copied to target are
> created by events which are not part of the bootstrap. This is because of the
> fact that, bootstrap first gets the last event id and then the file list.
> During this period if some event are added, then bootstrap will include files
> created by these events also.The same files will be copied again during the
> first incremental replication just after the bootstrap. In normal scenario,
> the duplicate copy does not cause any issue as hive allows the use of target
> database only after the first incremental. But in case of migration, the file
> at source and target are copied to different location (based on the write id
> at target) and thus this may lead to duplicate data at target. This can be
> avoided by having at check at load time for duplicate file. This check can be
> done only for the first incremental and the search can be done in the
> bootstrap directory (with write id 1). if the file is already present then
> just ignore the copy.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)