Github user JoshRosen commented on the pull request:
https://github.com/apache/spark/pull/5042#issuecomment-91905439
@liancheng, if this is intended only for writing directly to S3, then I
think we purposely want to avoid writing to `_temporary` because the move /
rename of the temporary file to the final file is expensive and because S3 PUT
operations are atomic. If you have two tasks racing to write to the same file,
then the last writer should win and we shouldn't have to worry about
partially-written files. As with DirectOutputCommitter, this is _not_ safe for
use on filesystems where file creation / overwrite isn't atomic, such as HDFS,
but that's not a use-case that we have to worry about here (I think this is
only for use by advanced users who know that they're writing directly to S3).
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]