felixcheung commented on issue #25979: [SPARK-29295][SQL] Insert overwrite to 
Hive external table partition should delete old data
URL: https://github.com/apache/spark/pull/25979#issuecomment-537773541
 
 
   yeah, this is a hard one, obviously the behavior is buggy, hard to detect 
etc. but that's how Hive is designed. I think we should log a warn in Spark at 
least so interested folks (like us) can detect this after the job is run
   
   > On Hive 2.1.0, two "INSERT OVERWRITE" produces data file with same name 
like 000000_0. The second "INSERT OVERWRITE" moves the file into and overwrite 
old file.
   
   > On Hive 2.3.2, the second "INSERT OVERWRITE" causes following failure when 
moving file with same name
   
   we can't really rely on the name being the same to overwrite. it depends on 
a number of things. for instance, if the original partition has 10B row and 1M 
file, overwritten with new partition having 1B and 100k file, then a lot of 
files are not going to be overwritten (like 900k)

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to