EnricoMi opened a new issue, #1512: URL: https://github.com/apache/incubator-uniffle/issues/1512
### Code of Conduct - [X] I agree to follow this project's [Code of Conduct](https://www.apache.org/foundation/policies/conduct) ### Search before asking - [X] I have searched in the [issues](https://github.com/apache/incubator-uniffle/issues?q=is%3Aissue) and found no similar issues. ### What would you like to be improved? The blockId consists of a sequence number, the partition id and the task attempt id. In Spark, partition id is 31 bits (positive int), the attempt id is 63 bit (positive long). However, as part of the 63 bit blockId, it has only a fraction of those bits available. By design, Uniffle does not support any arbitrary Spark app. Using Uniffle with large datasets (e.g. 100TB) and long running Spark apps with many stages, we face an overflow of these fields. This blocks us from migrating all our Spark workloads onto a Kubernetes cluster with Uniffle shuffle service. Related: - https://github.com/apache/incubator-uniffle/issues/1399 - https://github.com/apache/incubator-uniffle/issues/1398 - https://github.com/apache/incubator-uniffle/issues/731 - https://github.com/apache/incubator-uniffle/issues/420 - https://github.com/apache/incubator-uniffle/issues/134 ### How should we improve? The block id should store the map index (source partition), instead of the task attempt id of the map task. The former is a number between 0 and the number of partitions - 1, whereas the latter is an arbitrary large 63 bit number (monotonically increasing unique id). With each stage, the task attempt id grows by at least the number of partitions (unless AQE coalesces partitions). Long running Spark apps with hundred of thousands of partitions will experience task attempt ids in the 100 of million or even billion (30 bit). Using the map index would limit those ids to the number of partitions (e.g. 20 bit), no matter how many stages and jobs there will be. The task attempt id is part of the blockId, because multiple task attempts may write blocks for the same map index. Having the block id contain the task attempt id allows to retrieve blocks from just block ids alone. Moving over to the map index, those block ids would become ambiguous. However, the `BufferSegment` that provides metadata for blocks contain the `blockId` and the `taskAttemptId`. Replacing the task attempt id in the block id with the map index would make `blockId` ambiguous. Together, `blockId` and `taskAttemptId` form a unique identifier. ### Are you willing to submit PR? - [X] Yes I am willing to submit a PR! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
