attilapiros opened a new pull request, #37474:
URL: https://github.com/apache/spark/pull/37474

   
   ### What changes were proposed in this pull request?
   
   Currently on S3 the checkpoint file manager (called 
`FileContextBasedCheckpointFileManager`) is based on the rename operation. So 
when a file is opened for an atomic stream a temporary file will be used behind 
the scenes and when the stream is committed the file is renamed to its final 
location.
   
   But on S3 the rename operation will be a file copy so it has some serious 
performance implication.
   
   On Hadoop 3 there is new interface introduce called 
[Abortable](https://hadoop.apache.org/docs/stable/api/org/apache/hadoop/fs/Abortable.html)
 and S3AFileSystem has this capability which is implemented by on top S3's 
multipart upload. So when the file is committed [a POST is 
sent](https://docs.aws.amazon.com/AmazonS3/latest/API/API_CompleteMultipartUpload.html)
 and when aborted [a DELETE will be 
sent](https://docs.aws.amazon.com/AmazonS3/latest/API/API_AbortMultipartUpload.html).
   
   This avoids the file copying altogether.
   
   
   ### Why are the changes needed?
   
   For improving streaming performance.
   
   ### Does this PR introduce _any_ user-facing change?
   No.
   
   ### How was this patch tested?
   
   I have refactored the existing `CheckpointFileManagerTests` and run against 
a test filesystem which supports the `Abortable` interface (see 
`AbortableFileSystem` which is based on `RawLocalFileSystem`). 
   This way we have a unit test.
   
   Moreover the same test can be run against AWS S3 by using an integration 
test (see `AwsAbortableStreamBasedCheckpointFileManagerSuite`):
   
   ```
   -> S3_PATH=<..> AWS_ACCESS_KEY_ID=<..> AWS_SECRET_ACCESS_KEY=<..> 
AWS_SESSION_TOKEN=<..>  ./build/mvn install -pl hadoop-cloud  
-Phadoop-cloud,hadoop-3,integration-test
   
   Discovery starting.
   Discovery completed in 346 milliseconds.
   Run starting. Expected test count is: 1
   AwsAbortableStreamBasedCheckpointFileManagerSuite:
   - mkdirs, list, createAtomic, open, delete, exists
   CommitterBindingSuite:
   AbortableStreamBasedCheckpointFileManagerSuite:
   Run completed in 14 seconds, 407 milliseconds.
   Total number of tests run: 1
   Suites: completed 4, aborted 0
   Tests: succeeded 1, failed 0, canceled 0, ignored 0, pending 0
   All tests passed.
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to