Github user steveloughran commented on the issue:
https://github.com/apache/spark/pull/17834
The last one was on all the doc comments, and believe I've addressed them
both with the little typos and by focusing the docs on the main points for
Spark users: how stores differ from filesystems, and what it means
The big issue for spark users is "the commit problem", where I've listed
the object store behaviours and said "this means things may not work âconsult
the docs". I'm not being explicit as what works/doesn't work as that's the
moving target.
Right now, I don't trust commits with S3a using V1 or V2
FileOutputCommitter algorithms to work 100% of the time, because they rely on a
list consistency which Amazon S3 doesn't guarantee. I could make that a lot
clearer, something like
*You cannot reliably use the FileOutputCommitter to commit work to Amazon
S3 or Openstack Swift unless there is some form of consistency layer on top*.
That probably is the core concept people need to know: it's not safe
without something (EMR, S3Guard, Databricks Commit Service) to give you that
consistent view.
Then add pointers to the talks by myself and [Eric
Liang](https://www.slideshare.net/databricks/robust-and-scalable-etl-over-cloud-storage-with-apache-spark)
on the topic
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]