Github user steveloughran commented on the issue:

    https://github.com/apache/spark/pull/17834
  
    The last one was on all the doc comments, and believe I've addressed them 
both with the little typos and by focusing the docs on the main points for 
Spark users: how stores differ from filesystems, and what it means
    
    The big issue for spark users is "the commit problem", where I've listed 
the object store behaviours and said "this means things may not work —consult 
the docs". I'm not being explicit as what works/doesn't work as that's the 
moving target. 
    
    Right now, I don't trust commits with S3a using V1 or V2 
FileOutputCommitter algorithms to work 100% of the time, because they rely on a 
list consistency which Amazon S3 doesn't guarantee. I could make that a lot 
clearer, something like
    
    *You cannot reliably use the FileOutputCommitter to commit work to Amazon 
S3 or Openstack Swift unless there is some form of consistency layer on top*.
    
    That probably is the core concept people need to know: it's not safe 
without something (EMR, S3Guard, Databricks Commit Service) to give you that 
consistent view.
    
    Then add pointers to the talks by myself and [Eric 
Liang](https://www.slideshare.net/databricks/robust-and-scalable-etl-over-cloud-storage-with-apache-spark)
 on the topic 
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to