[GitHub] [spark] steveloughran commented on pull request #33828: [SPARK-36579][CORE][SQL] Make spark source stagingDir can be customized

GitBox Thu, 17 Mar 2022 04:09:38 -0700


steveloughran commented on pull request #33828:
URL: https://github.com/apache/spark/pull/33828#issuecomment-1070799392



   that dynamic partition worked overlapped with the committer extension work 
and the s3a committer.
   
   It broke the merge; those lines you've found show the workaround.
   
   A key problem with the spark code is that it assumes file rename is a good 
way to commit work. AFAIK, it doesn't assume that directory renames are atomic, 
but unless file renames work fast then performance is going to be 
unsatisfactory.
   
   And on S3, file rename is O(data), so applications which use it to promote 
work (hello hive!) really suffer.
   I that's why I have never looked for a good solution here.
   Things are different on azure and google cloud, where file rename usually(*) 
works. This means that we could look about what needs to be done.
   
   Is there anything written up on this commit protocol I could look at to see 
what could be done? 
   
   t the very least we could have the known committer implementations support 
StreamCapabilities.hasCapability() with some capabilities for the FS we could 
indirectly ask for related to rename (fast file rename, fast dir rename, atomic 
dir rename), which would let spark know what was actually viable at all. but 
those are really fs capabilities, you can't really expect the committer itself 
to know what the fs does, except in the case of the s3a committer, which is 
hard coded to one fs, whose semantics are known (though amplidata and netapp s3 
devices do have fast file copy/rename even there...)
    


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] steveloughran commented on pull request #33828: [SPARK-36579][CORE][SQL] Make spark source stagingDir can be customized

Reply via email to