[ 
https://issues.apache.org/jira/browse/HIVE-27494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17745368#comment-17745368
 ] 

Zhihua Deng commented on HIVE-27494:
------------------------------------

Thanks [~zabetak]!

I attached an example, in the explain.out, there are two maps: Map1 and Map3. 
Map1 will put his intermediate result under the directory:

${warehouse_dir}/nonacidpart/.hive-staging_hive_2023-07-21_11-39-00_562_4675614078807904377-1/-ext-10000/HIVE_UNION_SUBDIR_1,
 and Map3:

${warehouse_dir}/nonacidpart/.hive-staging_hive_2023-07-21_11-39-00_562_4675614078807904377-1/-ext-10000/HIVE_UNION_SUBDIR_2

The move task will move the 
${warehouse_dir}/nonacidpart/.hive-staging_hive_2023-07-21_11-39-00_562_4675614078807904377-1/-ext-10000
 to the table's directory.

As this is a dynamic partition insert, so the final temp directory would be:

${warehouse_dir}/nonacidpart/.hive-staging_hive_2023-07-21_11-39-00_562_4675614078807904377-1/_tmp.-ext-10000/${dynamic_partition}/HIVE_UNION_SUBDIR_1
 for Map1.

${warehouse_dir}/nonacidpart/.hive-staging_hive_2023-07-21_11-39-00_562_4675614078807904377-1/_tmp.-ext-10000/${dynamic_partition}/HIVE_UNION_SUBDIR_2
 for Map3.

When the TezTask finishes, it will close all operators one by one:

[https://github.com/apache/hive/blob/81759a105e50b9fd8c66ffbf2920f425a1d7a64c/ql/src/java/org/apache/hadoop/hive/ql/exec/tez/TezTask.java#L679-L689]

Assume the FS operator in Map1 closes first, it will rename the 
${warehouse_dir}/nonacidpart/.hive-staging_hive_2023-07-21_11-39-00_562_4675614078807904377-1/_tmp.-ext-10000
 to 
${warehouse_dir}/nonacidpart/.hive-staging_hive_2023-07-21_11-39-00_562_4675614078807904377-1/_tmp.-ext-10000.moved:

[https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java#L1453-L1456]

So the FS operator in Map3 has no chance to take care of his result as the 
parent dir has been renamed to another directory.

I think the temp directory with prefix "_tmp." only belongs to the FS operator, 
if we guarantee the final output will be safety moved to the directory 
${warehouse_dir}/nonacidpart/.hive-staging_hive_2023-07-21_11-39-00_562_4675614078807904377-1/-ext-10000
 where the move task knows, it's safe to change the FS internal directory 
structure.

 

> Deduplicate the task result that generated by more branches in union all
> ------------------------------------------------------------------------
>
>                 Key: HIVE-27494
>                 URL: https://issues.apache.org/jira/browse/HIVE-27494
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Zhihua Deng
>            Assignee: Zhihua Deng
>            Priority: Major
>              Labels: pull-request-available
>
> HIVE-23891 adds the ability to deduplicate the task result that under the 
> directory,
> <table-dir>/<staging-dir>/_tmp.-ext-10000/<dynamic-partition-dir>/HIVE_UNION_SUBDIR_1,
> but turns out to ignore taking the same action to the directory for the same 
> query:
> <table-dir>/<staging-dir>/_tmp.-ext-10000/<dynamic-partition-dir>/HIVE_UNION_SUBDIR_2.
> So user may still have the same data duplication problem in multiple tez task 
> attempts.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to