[
https://issues.apache.org/jira/browse/HIVE-27494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17744986#comment-17744986
]
Stamatis Zampetakis commented on HIVE-27494:
--------------------------------------------
The description talks directly about some internal directory structure but it's
difficult to follow for people (like myself) that are not familiar with the
area. [~dengzh] Can you add a few more high level infomration in the
description (such as DDL, query, plan, etc.,) that leads into this kind of
problematic situation? I had also a look in the PR but it has low level details
about code and it's hard to follow.
Before diving into a review I would like first to understand what the problem
is.
> Deduplicate the task result that generated by more branches in union all
> ------------------------------------------------------------------------
>
> Key: HIVE-27494
> URL: https://issues.apache.org/jira/browse/HIVE-27494
> Project: Hive
> Issue Type: Bug
> Reporter: Zhihua Deng
> Assignee: Zhihua Deng
> Priority: Major
> Labels: pull-request-available
>
> HIVE-23891 adds the ability to deduplicate the task result that under the
> directory,
> <table-dir>/<staging-dir>/_tmp.-ext-10000/<dynamic-partition-dir>/HIVE_UNION_SUBDIR_1,
> but turns out to ignore taking the same action to the directory for the same
> query:
> <table-dir>/<staging-dir>/_tmp.-ext-10000/<dynamic-partition-dir>/HIVE_UNION_SUBDIR_2.
> So user may still have the same data duplication problem in multiple tez task
> attempts.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)