EnricoMi opened a new issue, #1512:
URL: https://github.com/apache/incubator-uniffle/issues/1512

   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of 
Conduct](https://www.apache.org/foundation/policies/conduct)
   
   
   ### Search before asking
   
   - [X] I have searched in the 
[issues](https://github.com/apache/incubator-uniffle/issues?q=is%3Aissue) and 
found no similar issues.
   
   
   ### What would you like to be improved?
   
   The blockId consists of a sequence number, the partition id and the task 
attempt id. In Spark, partition id is 31 bits (positive int), the attempt id is 
63 bit (positive long). However, as part of the 63 bit blockId, it has only a 
fraction of those bits available. By design, Uniffle does not support any 
arbitrary Spark app.
   
   Using Uniffle with large datasets (e.g. 100TB) and long running Spark apps 
with many stages, we face an overflow of these fields. This blocks us from 
migrating all our Spark workloads onto a Kubernetes cluster with Uniffle 
shuffle service.
   
   Related:
   - https://github.com/apache/incubator-uniffle/issues/1399
   - https://github.com/apache/incubator-uniffle/issues/1398
   - https://github.com/apache/incubator-uniffle/issues/731
   - https://github.com/apache/incubator-uniffle/issues/420
   - https://github.com/apache/incubator-uniffle/issues/134
   
   ### How should we improve?
   
   The block id should store the map index (source partition), instead of the 
task attempt id of the map task. The former is a number between 0 and the 
number of partitions - 1, whereas the latter is an arbitrary large 63 bit 
number (monotonically increasing unique id).
   
   With each stage, the task attempt id grows by at least the number of 
partitions (unless AQE coalesces partitions). Long running Spark apps with 
hundred of thousands of partitions will experience task attempt ids in the 100 
of million or even billion (30 bit). Using the map index would limit those ids 
to the number of partitions (e.g. 20 bit), no matter how many stages and jobs 
there will be.
   
   The task attempt id is part of the blockId, because multiple task attempts 
may write blocks for the same map index. Having the block id contain the task 
attempt id allows to retrieve blocks from just block ids alone. Moving over to 
the map index, those block ids would become ambiguous.
   
   However, the `BufferSegment` that provides metadata for blocks contain the 
`blockId` and the `taskAttemptId`. Replacing the task attempt id in the block 
id with the map index would make `blockId` ambiguous. Together, `blockId` and 
`taskAttemptId` form a unique identifier.
   
   ### Are you willing to submit PR?
   
   - [X] Yes I am willing to submit a PR!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to