[
https://issues.apache.org/jira/browse/HIVE-24936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Harish JP updated HIVE-24936:
-----------------------------
Description:
The taskId and taskAttemptId is not extracted correctly for copy files
(00001_02_copy_3) and when doing a move file of an incompatible copy file the
rename utility generates wrong file names. Ex: 00001_02_copy_3 is renamed to
00001_02_copy_3_1 if 00001_02_copy_3 already exists, ideally it should be
00001_02_copy_N.
Incompatible files should be always renamed using the current task or it can
get deleted if the file name conflicts with another task output file. Ex: if
the input file name for a task is 00005_01 and is incompatible then if we move
this file, it will be treated as an output file for task id 5, attempt 1 which
if exists will try to generate the same file and fail and another attempt will
be made. There will be 2 files 00005_01, 00005_02, the deduping code will
remove 00005_01 resulting in data loss. There are other scenarios where the
same can happen.
was:The taskId and taskAttemptId is not extracted correctly for copy files
(00001_02_copy_3) and when doing a move file of an incompatible copy file the
rename utility generates wrong file names. Ex: 00001_02_copy_3 is renamed to
00001_02_copy_3_1 if 00001_02_copy_3 already exists, ideally it should be
00001_02_copy_N.
> Fix file name parsing and copy file move.
> -----------------------------------------
>
> Key: HIVE-24936
> URL: https://issues.apache.org/jira/browse/HIVE-24936
> Project: Hive
> Issue Type: Bug
> Components: HiveServer2
> Reporter: Harish JP
> Assignee: Harish JP
> Priority: Major
> Labels: pull-request-available
> Time Spent: 20m
> Remaining Estimate: 0h
>
> The taskId and taskAttemptId is not extracted correctly for copy files
> (00001_02_copy_3) and when doing a move file of an incompatible copy file the
> rename utility generates wrong file names. Ex: 00001_02_copy_3 is renamed to
> 00001_02_copy_3_1 if 00001_02_copy_3 already exists, ideally it should be
> 00001_02_copy_N.
>
> Incompatible files should be always renamed using the current task or it can
> get deleted if the file name conflicts with another task output file. Ex: if
> the input file name for a task is 00005_01 and is incompatible then if we
> move this file, it will be treated as an output file for task id 5, attempt 1
> which if exists will try to generate the same file and fail and another
> attempt will be made. There will be 2 files 00005_01, 00005_02, the deduping
> code will remove 00005_01 resulting in data loss. There are other scenarios
> where the same can happen.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)