Re: [PR] [CELEBORN-894] End to End Integrity Checks [celeborn]

via GitHub Tue, 24 Jun 2025 22:36:38 -0700


buska88 commented on PR #3261:
URL: https://github.com/apache/celeborn/pull/3261#issuecomment-3002613073


   > > * we have apps with **100k-600k mappers in a single stage (and multiple 
such stages)** that have been running reliably and performantly
   > 
   > Additionally, we still have concerns about high concurrency scenarios 
(executor * cores). We have actually applied end-to-end consistency validation 
in production scenarios, and recently we've been analyzing cases of driver 
celeborn RPC timeouts (500k mappers, 8000*2 cores), which may not necessarily 
be related to this change.
   
   We test this job in several cases.
   If we set spark.celeborn.client.shuffle.integrityCheck.enabled=false,  a 
MapEnd request costs 0.1-0.2ms;
   Use current pr to test, a MapEnd request costs 50-70ms.
   We find that the bottleneck is finishMapperAttempt in 
ReducePartitionCommitHandler, for (i <- 0 until numPartitions) {} costs a lot 
when numPartitions and numMappers is big, and if we choose to put it in 
shuffleMapperAttempts.synchronized, it leads to long MapEnd request.
   @gauravkm @gaoyajun02 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [CELEBORN-894] End to End Integrity Checks [celeborn]

Reply via email to