[ 
https://issues.apache.org/jira/browse/HIVE-23891?focusedWorklogId=461944&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-461944
 ]

ASF GitHub Bot logged work on HIVE-23891:
-----------------------------------------

                Author: ASF GitHub Bot
            Created on: 22/Jul/20 09:59
            Start Date: 22/Jul/20 09:59
    Worklog Time Spent: 10m 
      Work Description: georgepachitariu commented on a change in pull request 
#1294:
URL: https://github.com/apache/hive/pull/1294#discussion_r458678610



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java
##########
@@ -1459,8 +1459,12 @@ public static void mvFileToFinalPath(Path specPath, 
Configuration hconf,
       }
 
       // Remove duplicates from tmpPath
-      List<FileStatus> statusList = HiveStatsUtils.getFileStatusRecurse(
-          tmpPath, ((dpCtx == null) ? 1 : dpCtx.getNumDPCols()), fs);
+      int level = dpCtx == null ? 1 : dpCtx.getNumDPCols();
+      // - when execution engine is Tez and the query uses sql clause "Union"
+      //   there is an extra layer of subdirectories that contains the 
branches of the union;
+      boolean isUnionClauseInTez = 
conf.getDirName().toString().contains(AbstractFileMergeOperator.UNION_SUDBIR_PREFIX);

Review comment:
       Thank you :)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Issue Time Tracking
-------------------

    Worklog Id:     (was: 461944)
    Time Spent: 50m  (was: 40m)

> Using UNION sql clause and speculative execution can cause file duplication 
> in Tez
> ----------------------------------------------------------------------------------
>
>                 Key: HIVE-23891
>                 URL: https://issues.apache.org/jira/browse/HIVE-23891
>             Project: Hive
>          Issue Type: Bug
>            Reporter: George Pachitariu
>            Assignee: George Pachitariu
>            Priority: Major
>              Labels: pull-request-available
>         Attachments: HIVE-23891.1.patch
>
>          Time Spent: 50m
>  Remaining Estimate: 0h
>
> Hello, 
> the specific scenario when this can happen:
>  - the execution engine is Tez;
>  - speculative execution is on;
>  - the query inserts into a table and the last step is a UNION sql clause;
> The problem is that Tez creates an extra layer of subdirectories when there 
> is a UNION. Later, when deduplicating, Hive doesn't take that into account 
> and only deduplicates folders but not the files inside.
> So for a query like this:
> {code:sql}
> insert overwrite table union_all
>     select * from union_first_part
> union all
>     select * from union_second_part;
> {code}
> The folder structure afterwards will be like this (a possible example):
> {code:java}
> .../union_all/HIVE_UNION_SUBDIR_1/000000_0
> .../union_all/HIVE_UNION_SUBDIR_1/000000_1
> .../union_all/HIVE_UNION_SUBDIR_2/000000_1
> {code}
> The attached patch increases the number of folder levels that Hive will check 
> recursively for duplicates when we have a UNION in Tez.
> Feel free to reach out if you have any questions :).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to