[GitHub] [incubator-uniffle] zuston commented on issue #1030: [Umbrella] Object Storage Support (help wanted)

via GitHub Sun, 23 Jul 2023 19:31:22 -0700


zuston commented on issue #1030:
URL: 
https://github.com/apache/incubator-uniffle/issues/1030#issuecomment-1647107322


   > cc @xianjingfeng @zuston Could we finish this issue together?
   
   Yes. I will
   
   > Regarding upload to S3: As long as you use the Apache HDFS S3A adapter you 
can stream data to an object store. However you can only append as long as you 
keep the stream open and you can only do so from a single client. The [S3A 
filesystem implementation uses buffered multi-part uploads to stream a file to 
an object 
store](https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html#How_S3A_writes_data_to_S3).
 Streaming from multiple clients should be possible in principle, but the 
coordination overhead and the way Java streams are implemented make things 
tricky.
   
   I hope the append could be avoided in this design. And I think it's OK to 
store same partition data into different files in object store. Like this: 
   
   ```
   s3a://xxxxxxxxxx/{app_id}/{shuffle_id}/{partition_id}/0.index
   s3a://xxxxxxxxxx/{app_id}/{shuffle_id}/{partition_id}/0.data
   s3a://xxxxxxxxxx/{app_id}/{shuffle_id}/{partition_id}/1.index
   s3a://xxxxxxxxxx/{app_id}/{shuffle_id}/{partition_id}/1.data
   ....
   
   
   ....
   s3a://xxxxxxxxxx/{app_id}/{shuffle_id}/{partition_id}/990.index
   s3a://xxxxxxxxxx/{app_id}/{shuffle_id}/{partition_id}/990.data
   ```
   
   The one flush of shuffle-server for one partition could be flushed into one 
file. But this is ensured by the following rules.
   1. The partition must be managed by single shuffle-server. Because the `id` 
of file prefix name only known with shuffle-server
   
   For reader, it could get the `endId` 
(s3a://xxxxxxxxxx/{app_id}/{shuffle_id}/{partition_id}/endId.data) from the 
shuffle-server. That means we need not `list` operation
   
   
   If I'm wrong, feel free to point out
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [incubator-uniffle] zuston commented on issue #1030: [Umbrella] Object Storage Support (help wanted)

Reply via email to