[GitHub] [spark] steveloughran commented on pull request #37474: [SPARK-40039][SS][WIP] Introducing a streaming checkpoint file manager based on Hadoop's Abortable interface

GitBox Thu, 11 Aug 2022 01:36:02 -0700


steveloughran commented on PR #37474:
URL: https://github.com/apache/spark/pull/37474#issuecomment-1211699163


   hey @HeartSaVioR. yes, this is exactly what the API we worked on was 
designed for.
   
   There is no need to initiate an MPU when writing small files; the 
OutputStream simply doesn't upload the data anymore. you can check this by 
calling toString() on the stream, all its IO stats there. This means that the 
cost is as normal; one PUT for data <= the block size, after that one POST to 
initiate, one POST per block and one POST in close() to finalize. block uploads 
are parallelised, though you do need enough https connection for this.
   
   It's no more expensive than normal write; upload performance will be the 
same. except when you call abort(), when it is faster.
   
   that said, let me review the code to confirm this
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] steveloughran commented on pull request #37474: [SPARK-40039][SS][WIP] Introducing a streaming checkpoint file manager based on Hadoop's Abortable interface

Reply via email to