danielcweeks commented on pull request #1573: URL: https://github.com/apache/iceberg/pull/1573#issuecomment-721415513
> Dan and I had a discussion about `create` and came to an agreement to implement it, but to make sure the docs for it don't make claims about atomicity. It is still useful to have it because some engines may use `create` without overwrite to write data files. And although there are no guarantees when used concurrently, at least only one writer wins. We've had issues with concurrent writes corrupting version-hint.txt files in other file systems, so a lack of guarantees isn't unique to S3. I had to make some significant changes in order to make create work in what seems like a reasonable way, but there's lots of behavioral things to consider. For example, S3InputFile caches the exists call (as does the S3OutputFile now). I believe we do that to optimize for a traditional exists() -> contentLength() -> open() pattern so that we don't make multiple requests to get the same information from S3. However, this may not make sense in the create path. This also exposes some strange aspects of the FileIO api in that `deleteFile` is a function of `FileIO` and `exists()` is a function of `InputFile` (though not `OutputFile`). ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
