[
https://issues.apache.org/jira/browse/HIVE-27494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17745368#comment-17745368
]
Zhihua Deng commented on HIVE-27494:
------------------------------------
Thanks [~zabetak]!
I attached an example, in the explain.out, there are two maps: Map1 and Map3.
Map1 will put his intermediate result under the directory:
${warehouse_dir}/nonacidpart/.hive-staging_hive_2023-07-21_11-39-00_562_4675614078807904377-1/-ext-10000/HIVE_UNION_SUBDIR_1,
and Map3:
${warehouse_dir}/nonacidpart/.hive-staging_hive_2023-07-21_11-39-00_562_4675614078807904377-1/-ext-10000/HIVE_UNION_SUBDIR_2
The move task will move the
${warehouse_dir}/nonacidpart/.hive-staging_hive_2023-07-21_11-39-00_562_4675614078807904377-1/-ext-10000
to the table's directory.
As this is a dynamic partition insert, so the final temp directory would be:
${warehouse_dir}/nonacidpart/.hive-staging_hive_2023-07-21_11-39-00_562_4675614078807904377-1/_tmp.-ext-10000/${dynamic_partition}/HIVE_UNION_SUBDIR_1
for Map1.
${warehouse_dir}/nonacidpart/.hive-staging_hive_2023-07-21_11-39-00_562_4675614078807904377-1/_tmp.-ext-10000/${dynamic_partition}/HIVE_UNION_SUBDIR_2
for Map3.
When the TezTask finishes, it will close all operators one by one:
[https://github.com/apache/hive/blob/81759a105e50b9fd8c66ffbf2920f425a1d7a64c/ql/src/java/org/apache/hadoop/hive/ql/exec/tez/TezTask.java#L679-L689]
Assume the FS operator in Map1 closes first, it will rename the
${warehouse_dir}/nonacidpart/.hive-staging_hive_2023-07-21_11-39-00_562_4675614078807904377-1/_tmp.-ext-10000
to
${warehouse_dir}/nonacidpart/.hive-staging_hive_2023-07-21_11-39-00_562_4675614078807904377-1/_tmp.-ext-10000.moved:
[https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java#L1453-L1456]
So the FS operator in Map3 has no chance to take care of his result as the
parent dir has been renamed to another directory.
I think the temp directory with prefix "_tmp." only belongs to the FS operator,
if we guarantee the final output will be safety moved to the directory
${warehouse_dir}/nonacidpart/.hive-staging_hive_2023-07-21_11-39-00_562_4675614078807904377-1/-ext-10000
where the move task knows, it's safe to change the FS internal directory
structure.
> Deduplicate the task result that generated by more branches in union all
> ------------------------------------------------------------------------
>
> Key: HIVE-27494
> URL: https://issues.apache.org/jira/browse/HIVE-27494
> Project: Hive
> Issue Type: Bug
> Reporter: Zhihua Deng
> Assignee: Zhihua Deng
> Priority: Major
> Labels: pull-request-available
>
> HIVE-23891 adds the ability to deduplicate the task result that under the
> directory,
> <table-dir>/<staging-dir>/_tmp.-ext-10000/<dynamic-partition-dir>/HIVE_UNION_SUBDIR_1,
> but turns out to ignore taking the same action to the directory for the same
> query:
> <table-dir>/<staging-dir>/_tmp.-ext-10000/<dynamic-partition-dir>/HIVE_UNION_SUBDIR_2.
> So user may still have the same data duplication problem in multiple tez task
> attempts.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)