steveloughran commented on pull request #33828:
URL: https://github.com/apache/spark/pull/33828#issuecomment-1070799392
that dynamic partition worked overlapped with the committer extension work
and the s3a committer.
It broke the merge; those lines you've found show the workaround.
A key problem with the spark code is that it assumes file rename is a good
way to commit work. AFAIK, it doesn't assume that directory renames are atomic,
but unless file renames work fast then performance is going to be
unsatisfactory.
And on S3, file rename is O(data), so applications which use it to promote
work (hello hive!) really suffer.
I that's why I have never looked for a good solution here.
Things are different on azure and google cloud, where file rename usually(*)
works. This means that we could look about what needs to be done.
Is there anything written up on this commit protocol I could look at to see
what could be done?
t the very least we could have the known committer implementations support
StreamCapabilities.hasCapability() with some capabilities for the FS we could
indirectly ask for related to rename (fast file rename, fast dir rename, atomic
dir rename), which would let spark know what was actually viable at all. but
those are really fs capabilities, you can't really expect the committer itself
to know what the fs does, except in the case of the s3a committer, which is
hard coded to one fs, whose semantics are known (though amplidata and netapp s3
devices do have fast file copy/rename even there...)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]