[ https://issues.apache.org/jira/browse/HIVE-7870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14124358#comment-14124358 ]
Na Yang commented on HIVE-7870: ------------------------------- Removing those duplicated filesinks is hard because during the time that those filesinks are added to the filesinkset, it is hard to know which filesink is eventually used by the spark work. if we remove the wrong filesink from the filesinkset, then we are not able to create the proper linkedfilesinks for the target filesink. This will cause wrong result for the merge and move work when hive.merge.sparkfiles is turned ON. For example, in the following query, three duplicate filesink FS1, FS2, FS3 will be added to the filesinkset. (the number is according to the order they are added to the filesinkset), FS2 and FS3 will be used for the subqueries of the outer union. In addition, FS2 and FS3 have different directory when hive.merge.sparkfiles=true. insert overwrite table outputTbl1 SELECT * FROM ( select key, 1 as values from inputTbl1 union all select * FROM ( SELECT key, count(1) as values from inputTbl1 group by key UNION ALL SELECT key, 2 as values from inputTbl1 ) a )b; However, in the following query, same as above query, three duplicate filesink FS1, FS2, FS3 will be added to the filesinkset. But FS1 will be used for the subqueries of the union. FS1, FS2 and FS3 all have the same directory when hive.merge.sparkfiles=true. insert overwrite table outputTbl1 SELECT * FROM ( select key, 1 as values from inputTbl1 union all select * FROM ( SELECT key, 3 as values from inputTbl1 UNION ALL SELECT key, 2 as values from inputTbl1 ) a )b; When the filesinks are added to the filesinkset, the final plan has not been generated yet, so there is no way to know which filesink should not be added to the set. After the final plan is generated, it is hard to detect the duplicate filesinks and remove the right one either. Therefore, duplicate filesinks are in the filesinkset. The potential problem that duplicate filesinks cause is generating multiple merge and move works when hive.merge.sparkfiles=true. This problem has been resolved in the patch by linking those duplicate filesinks together and use a HashMap to make sure one directory only gets processed once and only one merge and move work will be generated for each directory no matter how many duplicate filesinks exist. > Insert overwrite table query does not generate correct task plan [Spark > Branch] > ------------------------------------------------------------------------------- > > Key: HIVE-7870 > URL: https://issues.apache.org/jira/browse/HIVE-7870 > Project: Hive > Issue Type: Sub-task > Components: Spark > Reporter: Na Yang > Assignee: Na Yang > Labels: Spark-M1 > Attachments: HIVE-7870.1-spark.patch, HIVE-7870.2-spark.patch, > HIVE-7870.3-spark.patch, HIVE-7870.4-spark.patch, HIVE-7870.5-spark.patch > > > Insert overwrite table query does not generate correct task plan when > hive.optimize.union.remove and hive.merge.sparkfiles properties are ON. > {noformat} > set hive.optimize.union.remove=true > set hive.merge.sparkfiles=true > insert overwrite table outputTbl1 > SELECT * FROM > ( > select key, 1 as values from inputTbl1 > union all > select * FROM ( > SELECT key, count(1) as values from inputTbl1 group by key > UNION ALL > SELECT key, 2 as values from inputTbl1 > ) a > )b; > select * from outputTbl1 order by key, values; > {noformat} > query result > {noformat} > 1 1 > 1 2 > 2 1 > 2 2 > 3 1 > 3 2 > 7 1 > 7 2 > 8 2 > 8 2 > 8 2 > {noformat} > expected result: > {noformat} > 1 1 > 1 1 > 1 2 > 2 1 > 2 1 > 2 2 > 3 1 > 3 1 > 3 2 > 7 1 > 7 1 > 7 2 > 8 1 > 8 1 > 8 2 > 8 2 > 8 2 > {noformat} > Move work is not working properly and some data are missing during move. -- This message was sent by Atlassian JIRA (v6.3.4#6332)