felixcheung commented on issue #25979: [SPARK-29295][SQL] Insert overwrite to Hive external table partition should delete old data URL: https://github.com/apache/spark/pull/25979#issuecomment-537773541 yeah, this is a hard one, obviously the behavior is buggy, hard to detect etc. but that's how Hive is designed. I think we should log a warn in Spark at least so interested folks (like us) can detect this after the job is run > On Hive 2.1.0, two "INSERT OVERWRITE" produces data file with same name like 000000_0. The second "INSERT OVERWRITE" moves the file into and overwrite old file. > On Hive 2.3.2, the second "INSERT OVERWRITE" causes following failure when moving file with same name we can't really rely on the name being the same to overwrite. it depends on a number of things. for instance, if the original partition has 10B row and 1M file, overwritten with new partition having 1B and 100k file, then a lot of files are not going to be overwritten (like 900k)
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
