[
https://issues.apache.org/jira/browse/HIVE-23891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
László Bodor updated HIVE-23891:
--------------------------------
Resolution: Fixed
Status: Resolved (was: Patch Available)
> UNION ALL and multiple task attempts can cause file duplication
> ---------------------------------------------------------------
>
> Key: HIVE-23891
> URL: https://issues.apache.org/jira/browse/HIVE-23891
> Project: Hive
> Issue Type: Bug
> Reporter: George Pachitariu
> Assignee: Zhihua Deng
> Priority: Major
> Labels: pull-request-available
> Fix For: 4.0.0-alpha-2
>
> Attachments: HIVE-23891.1.patch
>
> Time Spent: 5h 10m
> Remaining Estimate: 0h
>
> Hello,
> the specific scenario when this can happen:
> - the execution engine is Tez;
> - speculative execution is on;
> - the query inserts into a table and the last step is a UNION sql clause;
> The problem is that Tez creates an extra layer of subdirectories when there
> is a UNION. Later, when deduplicating, Hive doesn't take that into account
> and only deduplicates folders but not the files inside.
> So for a query like this:
> {code:sql}
> insert overwrite table union_all
> select * from union_first_part
> union all
> select * from union_second_part;
> {code}
> The folder structure afterwards will be like this (a possible example):
> {code:java}
> .../union_all/HIVE_UNION_SUBDIR_1/000000_0
> .../union_all/HIVE_UNION_SUBDIR_1/000000_1
> .../union_all/HIVE_UNION_SUBDIR_2/000000_1
> {code}
> The attached patch increases the number of folder levels that Hive will check
> recursively for duplicates when we have a UNION in Tez.
> Feel free to reach out if you have any questions :).
--
This message was sent by Atlassian Jira
(v8.20.10#820010)