Zihao Ye created HIVE-20912:

             Summary: Output data might be duplicated while speculation is 
                 Key: HIVE-20912
                 URL: https://issues.apache.org/jira/browse/HIVE-20912
             Project: Hive
          Issue Type: Bug
          Components: Hive, Operators
    Affects Versions: 1.2.1
         Environment: Hive 1.2.1

Hadoop 2.7.3

Tez 0.7.0
            Reporter: Zihao Ye
         Attachments: image-2018-11-14-17-48-59-826.png, 
image-2018-11-14-17-53-13-191.png, image-2018-11-14-17-53-50-171.png, 

The file merge stage had two tasks, which should create two files, but there 
was three files created.


By tracing the log, we found that there were two task attempts(one of them was 
a speculation) finished in one second by such a coincidence. Although the later 
one received a kill signal from AM, the rename operation was already done at 
that time, which cause the data duplication.

The rename operation was done at _AbstractFileMergeOperator.closeOp()_, the __ 
final path name was determined by the task attempt id rather than the task id. 
In this case, the final path ended with '000000_0' and '000000_1' rather than 
'000000'. IMHO, by making the final path name ended with task id without task 
attempt id, one task can only generate at most one file, which could solve this 
issue. But I don't know the side effects for changing the final path name.

This issue also affects other operators related to file renaming like 
JoinOperator and FileSinkOperator.



This message was sent by Atlassian JIRA

Reply via email to