[ 
https://issues.apache.org/jira/browse/HIVE-24936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Harish JP updated HIVE-24936:
-----------------------------
    Description: 
The taskId and taskAttemptId is not extracted correctly for copy files 
(00001_02_copy_3) and when doing a move file of an incompatible copy file the 
rename utility generates wrong file names. Ex: 00001_02_copy_3 is renamed to 
00001_02_copy_3_1 if 00001_02_copy_3 already exists, ideally it should be 
00001_02_copy_N.

 

Incompatible files should be always renamed using the current task or it can 
get deleted if the file name conflicts with another task output file. Ex: if 
the input file name for a task is 00005_01 and is incompatible then if we move 
this file, it will be treated as an output file for task id 5, attempt 1 which 
if exists will try to generate the same file and fail and another attempt will 
be made. There will be 2 files 00005_01, 00005_02, the deduping code will 
remove 00005_01 resulting in data loss. There are other scenarios where the 
same can happen.

  was:The taskId and taskAttemptId is not extracted correctly for copy files 
(00001_02_copy_3) and when doing a move file of an incompatible copy file the 
rename utility generates wrong file names. Ex: 00001_02_copy_3 is renamed to 
00001_02_copy_3_1 if 00001_02_copy_3 already exists, ideally it should be 
00001_02_copy_N.


> Fix file name parsing and copy file move.
> -----------------------------------------
>
>                 Key: HIVE-24936
>                 URL: https://issues.apache.org/jira/browse/HIVE-24936
>             Project: Hive
>          Issue Type: Bug
>          Components: HiveServer2
>            Reporter: Harish JP
>            Assignee: Harish JP
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> The taskId and taskAttemptId is not extracted correctly for copy files 
> (00001_02_copy_3) and when doing a move file of an incompatible copy file the 
> rename utility generates wrong file names. Ex: 00001_02_copy_3 is renamed to 
> 00001_02_copy_3_1 if 00001_02_copy_3 already exists, ideally it should be 
> 00001_02_copy_N.
>  
> Incompatible files should be always renamed using the current task or it can 
> get deleted if the file name conflicts with another task output file. Ex: if 
> the input file name for a task is 00005_01 and is incompatible then if we 
> move this file, it will be treated as an output file for task id 5, attempt 1 
> which if exists will try to generate the same file and fail and another 
> attempt will be made. There will be 2 files 00005_01, 00005_02, the deduping 
> code will remove 00005_01 resulting in data loss. There are other scenarios 
> where the same can happen.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to