Github user tejasapatil commented on the issue:
https://github.com/apache/spark/pull/18975
There is a difference in Hive's semantics vs what this PR is doing. In
Hive, the query execution writes to a staging location and the destination
location is cleared + re-populated after the end of query execution (it
happens in `MoveTask`). This PR will first wipe the destination location and
then perform the query execution to populate the destination location with
desired data.
I like the hive model because:
- If the query execution fails, you atleast have the old data. Insert
overwrite to table does the same thing. This PR will leave the output location
empty.
- Hive achieves atomicity by using a staging dir. With this PR, I am not
sure what happens to the output location if the some tasks have written out the
final data while rest are still working. Would it have partial output ?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]