[ 
https://issues.apache.org/jira/browse/HIVE-7870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14124358#comment-14124358
 ] 

Na Yang commented on HIVE-7870:
-------------------------------

Removing those duplicated filesinks is hard because during the time that those 
filesinks are added to the filesinkset, it is hard to know which filesink is 
eventually used by the spark work. if we remove the wrong filesink from the 
filesinkset, then we are not able to create the proper linkedfilesinks for the 
target filesink. This will cause wrong result for the merge and move work when 
hive.merge.sparkfiles is turned ON.  

For example, in the following query, three duplicate filesink  FS1, FS2, FS3 
will be added to the filesinkset. (the number is according to the order they 
are added to the filesinkset), FS2 and FS3 will be used for the subqueries of 
the outer union. In addition, FS2 and FS3 have different directory when 
hive.merge.sparkfiles=true.

insert overwrite table outputTbl1
SELECT * FROM
(
select key, 1 as values from inputTbl1
union all
select * FROM (
  SELECT key, count(1) as values from inputTbl1 group by key
  UNION ALL
  SELECT key, 2 as values from inputTbl1
) a
)b;

However, in the following query, same as above query, three duplicate filesink  
FS1, FS2, FS3 will be added to the filesinkset. But FS1 will be used for the 
subqueries of the union. FS1, FS2 and FS3 all have the same directory when 
hive.merge.sparkfiles=true.

insert overwrite table outputTbl1
SELECT * FROM
(
select key, 1 as values from inputTbl1
union all
select * FROM (
  SELECT key, 3 as values from inputTbl1
  UNION ALL
  SELECT key, 2 as values from inputTbl1
) a
)b;

When the filesinks are added to the filesinkset, the final plan has not been 
generated yet, so there is no way to know which filesink should not be added to 
the set. After the final plan is generated, it is hard to detect the duplicate 
filesinks and remove the right one either. 

Therefore, duplicate filesinks are in the filesinkset. The potential problem 
that duplicate filesinks cause is generating multiple merge and move works when 
hive.merge.sparkfiles=true. This problem has been resolved in the patch by 
linking those duplicate filesinks together and use a HashMap to make sure one 
directory only gets processed once and only one merge and move work will be 
generated for each directory no matter how many duplicate filesinks exist. 



> Insert overwrite table query does not generate correct task plan [Spark 
> Branch]
> -------------------------------------------------------------------------------
>
>                 Key: HIVE-7870
>                 URL: https://issues.apache.org/jira/browse/HIVE-7870
>             Project: Hive
>          Issue Type: Sub-task
>          Components: Spark
>            Reporter: Na Yang
>            Assignee: Na Yang
>              Labels: Spark-M1
>         Attachments: HIVE-7870.1-spark.patch, HIVE-7870.2-spark.patch, 
> HIVE-7870.3-spark.patch, HIVE-7870.4-spark.patch, HIVE-7870.5-spark.patch
>
>
> Insert overwrite table query does not generate correct task plan when 
> hive.optimize.union.remove and hive.merge.sparkfiles properties are ON. 
> {noformat}
> set hive.optimize.union.remove=true
> set hive.merge.sparkfiles=true
> insert overwrite table outputTbl1
> SELECT * FROM
> (
> select key, 1 as values from inputTbl1
> union all
> select * FROM (
>   SELECT key, count(1) as values from inputTbl1 group by key
>   UNION ALL
>   SELECT key, 2 as values from inputTbl1
> ) a
> )b;
> select * from outputTbl1 order by key, values;
> {noformat}
> query result
> {noformat}
> 1     1
> 1     2
> 2     1
> 2     2
> 3     1
> 3     2
> 7     1
> 7     2
> 8     2
> 8     2
> 8     2
> {noformat}
> expected result:
> {noformat}
> 1     1
> 1     1
> 1     2
> 2     1
> 2     1
> 2     2
> 3     1
> 3     1
> 3     2
> 7     1
> 7     1
> 7     2
> 8     1
> 8     1
> 8     2
> 8     2
> 8     2
> {noformat}
> Move work is not working properly and some data are missing during move.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to