zuston commented on issue #1030: URL: https://github.com/apache/incubator-uniffle/issues/1030#issuecomment-1647107322
> cc @xianjingfeng @zuston Could we finish this issue together? Yes. I will > Regarding upload to S3: As long as you use the Apache HDFS S3A adapter you can stream data to an object store. However you can only append as long as you keep the stream open and you can only do so from a single client. The [S3A filesystem implementation uses buffered multi-part uploads to stream a file to an object store](https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html#How_S3A_writes_data_to_S3). Streaming from multiple clients should be possible in principle, but the coordination overhead and the way Java streams are implemented make things tricky. I hope the append could be avoided in this design. And I think it's OK to store same partition data into different files in object store. Like this: ``` s3a://xxxxxxxxxx/{app_id}/{shuffle_id}/{partition_id}/0.index s3a://xxxxxxxxxx/{app_id}/{shuffle_id}/{partition_id}/0.data s3a://xxxxxxxxxx/{app_id}/{shuffle_id}/{partition_id}/1.index s3a://xxxxxxxxxx/{app_id}/{shuffle_id}/{partition_id}/1.data .... .... s3a://xxxxxxxxxx/{app_id}/{shuffle_id}/{partition_id}/990.index s3a://xxxxxxxxxx/{app_id}/{shuffle_id}/{partition_id}/990.data ``` The one flush of shuffle-server for one partition could be flushed into one file. But this is ensured by the following rules. 1. The partition must be managed by single shuffle-server. Because the `id` of file prefix name only known with shuffle-server For reader, it could get the `endId` (s3a://xxxxxxxxxx/{app_id}/{shuffle_id}/{partition_id}/endId.data) from the shuffle-server. That means we need not `list` operation If I'm wrong, feel free to point out -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@uniffle.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org