Yao Guangdong created HIVE-25836:
------------------------------------

             Summary: Tez union operation may cause duplicate data
                 Key: HIVE-25836
                 URL: https://issues.apache.org/jira/browse/HIVE-25836
             Project: Hive
          Issue Type: Bug
          Components: Hive
    Affects Versions: 2.3.8, 2.3.0
            Reporter: Yao Guangdong


When we use tez union all operation.Which will cause some duplicate data in 
some cases. Which is because tez union all operation can generate sub directory 
in the table or parition directory.The sub directory use number as name and the 
result data file will stored in sub directory.If the sub directory have the 
speculate task execute and the speculate task's result file also in the sub 
directory.The hive client will delete duplicate task's file when the job 
finished.The hive client only check one level have the duplicate task's 
file.BecauseĀ  the sub directory's exsist. Which make the sub directory's 
duplicate task's file not delete and the duplicate data happened.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to