[ 
https://issues.apache.org/jira/browse/HIVE-23891?focusedWorklogId=841149&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-841149
 ]

ASF GitHub Bot logged work on HIVE-23891:
-----------------------------------------

                Author: ASF GitHub Bot
            Created on: 23/Jan/23 14:44
            Start Date: 23/Jan/23 14:44
    Worklog Time Spent: 10m 
      Work Description: dengzhhu653 commented on code in PR #3836:
URL: https://github.com/apache/hive/pull/3836#discussion_r1084142900


##########
ql/src/test/org/apache/hadoop/hive/ql/exec/TestFileSinkOperator.java:
##########
@@ -183,37 +183,42 @@ public void testNonAcidRemoveDuplicate() throws Exception 
{
 
     JobConf jobConf2 = new JobConf(jobConf);
     jobConf2.set("mapred.task.id", "000000_1");
-    FileSinkOperator speculative = (FileSinkOperator)OperatorFactory.get(
+    FileSinkOperator op2 = (FileSinkOperator)OperatorFactory.get(
         new CompilationOpContext(), FileSinkDesc.class);
-    speculative.setConf(desc);
-    speculative.initialize(jobConf2, new ObjectInspector[]{inspector});
+    op2.setConf(desc);
+    op2.initialize(jobConf2, new ObjectInspector[]{inspector});
 
     for (Object r : rows) {
       op1.process(r, 0);
-      speculative.process(r, 0);
+      op2.process(r, 0);
     }
 
     op1.close(false);
-    // speculative task also ends successfully
-    speculative.close(false);
+    // Assume op2 also ends successfully, this happens in different containers
+    op2.close(false);
     Path[] paths = findFilesInBasePath();
     List<Path> mondays = Arrays.stream(paths)
         .filter(path -> 
path.getParent().toString().endsWith("partval=Monday/HIVE_UNION_SUBDIR_0"))
         .collect(Collectors.toList());
-    Assert.assertTrue(mondays.size() == 2);
+    Assert.assertEquals("Two result files were created", 2, mondays.size());
     Set<String> fileNames = new HashSet<>();
     fileNames.add(mondays.get(0).getName());
     fileNames.add(mondays.get(1).getName());
-    Assert.assertTrue(fileNames.contains("000000_1") && 
fileNames.contains("000000_0"));
 
+    Assert.assertTrue("000000_1 file is expected", 
fileNames.contains("000000_1"));
+    Assert.assertTrue("000000_0 file is expected", 
fileNames.contains("000000_0"));
+
+    // This happens in HiveServer2 when the job is finished, the job will call

Review Comment:
   yeah, the `closeOp(boolean abort)` is called from task, but the `jobCloseOp` 
is called from the jobclient side(jobClose-> jobCloseOp):
   
   
https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/Operator.java#L732-L748
   
   





Issue Time Tracking
-------------------

    Worklog Id:     (was: 841149)
    Time Spent: 4.5h  (was: 4h 20m)

> Using UNION sql clause and speculative execution can cause file duplication 
> in Tez
> ----------------------------------------------------------------------------------
>
>                 Key: HIVE-23891
>                 URL: https://issues.apache.org/jira/browse/HIVE-23891
>             Project: Hive
>          Issue Type: Bug
>            Reporter: George Pachitariu
>            Assignee: George Pachitariu
>            Priority: Major
>              Labels: pull-request-available
>         Attachments: HIVE-23891.1.patch
>
>          Time Spent: 4.5h
>  Remaining Estimate: 0h
>
> Hello, 
> the specific scenario when this can happen:
>  - the execution engine is Tez;
>  - speculative execution is on;
>  - the query inserts into a table and the last step is a UNION sql clause;
> The problem is that Tez creates an extra layer of subdirectories when there 
> is a UNION. Later, when deduplicating, Hive doesn't take that into account 
> and only deduplicates folders but not the files inside.
> So for a query like this:
> {code:sql}
> insert overwrite table union_all
>     select * from union_first_part
> union all
>     select * from union_second_part;
> {code}
> The folder structure afterwards will be like this (a possible example):
> {code:java}
> .../union_all/HIVE_UNION_SUBDIR_1/000000_0
> .../union_all/HIVE_UNION_SUBDIR_1/000000_1
> .../union_all/HIVE_UNION_SUBDIR_2/000000_1
> {code}
> The attached patch increases the number of folder levels that Hive will check 
> recursively for duplicates when we have a UNION in Tez.
> Feel free to reach out if you have any questions :).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to