[ https://issues.apache.org/jira/browse/HIVE-7870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14113142#comment-14113142 ]
Chao commented on HIVE-7870: ---------------------------- There are two issues here: 1. when hive.merge.sparkfiles is turned on, {{GenSparkUtils::removeUnionOperators}} will add duplicate fileSinks, and then generate duplicate move/merge tasks for them (Thanks [~nyang] for the discussion) 2. when hive.optimization.union.remove is turned on, some dependency in the {{SparkWork}} is lost, and hence the result table only contain partial data. In MR's {{GenMRFileSink1}}, it checks linked file sink descriptors, and link them to the same task. The corresponding procedure is lacking in Tez/Spark implementation. > Insert overwrite table query does not generate correct task plan > ---------------------------------------------------------------- > > Key: HIVE-7870 > URL: https://issues.apache.org/jira/browse/HIVE-7870 > Project: Hive > Issue Type: Sub-task > Components: Spark > Reporter: Na Yang > Assignee: Chao > Labels: Spark-M1 > > Insert overwrite table query does not generate correct task plan when > hive.optimize.union.remove and hive.merge.sparkfiles properties are ON. > {noformat} > set hive.optimize.union.remove=true > set hive.merge.sparkfiles=true > insert overwrite table outputTbl1 > SELECT * FROM > ( > select key, 1 as values from inputTbl1 > union all > select * FROM ( > SELECT key, count(1) as values from inputTbl1 group by key > UNION ALL > SELECT key, 2 as values from inputTbl1 > ) a > )b; > select * from outputTbl1 order by key, values; > {noformat} > query result > {noformat} > 1 1 > 1 2 > 2 1 > 2 2 > 3 1 > 3 2 > 7 1 > 7 2 > 8 2 > 8 2 > 8 2 > {noformat} > expected result: > {noformat} > 1 1 > 1 1 > 1 2 > 2 1 > 2 1 > 2 2 > 3 1 > 3 1 > 3 2 > 7 1 > 7 1 > 7 2 > 8 1 > 8 1 > 8 2 > 8 2 > 8 2 > {noformat} > Move work is not working properly and some data are missing during move. -- This message was sent by Atlassian JIRA (v6.2#6252)