[
https://issues.apache.org/jira/browse/HIVE-20912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Zihao Ye updated HIVE-20912:
----------------------------
Priority: Critical (was: Major)
> Output data might be duplicated while speculation is enabled
> ------------------------------------------------------------
>
> Key: HIVE-20912
> URL: https://issues.apache.org/jira/browse/HIVE-20912
> Project: Hive
> Issue Type: Bug
> Components: Hive, Operators
> Affects Versions: 1.2.1
> Environment: Hive 1.2.1
> Hadoop 2.7.3
> Tez 0.7.0
> Reporter: Zihao Ye
> Priority: Critical
> Attachments: image-2018-11-14-17-48-59-826.png,
> image-2018-11-14-17-53-13-191.png, image-2018-11-14-17-53-50-171.png,
> image-2018-11-14-19-28-18-924.png
>
>
> The file merge stage had two tasks, which should create two files, but there
> was three files created.
> !image-2018-11-14-19-28-18-924.png!
> By tracing the log, we found that there were two task attempts(one of them
> was a speculation) finished in one second by such a coincidence. Although the
> later one received a kill signal from AM, the rename operation was already
> done at that time, which cause the data duplication.
> The rename operation was done at _AbstractFileMergeOperator.closeOp()_, the
> __ final path name was determined by the task attempt id rather than the task
> id. In this case, the final path ended with '000000_0' and '000000_1' rather
> than '000000'. IMHO, by making the final path name ended with task id without
> task attempt id, one task can only generate at most one file, which could
> solve this issue. But I don't know the side effects for changing the final
> path name.
> This issue also affects other operators related to file renaming like
> JoinOperator and FileSinkOperator.
> !image-2018-11-14-17-53-13-191.png!
> !image-2018-11-14-17-53-50-171.png!
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)