CodingCat commented on PR #3569:
URL: https://github.com/apache/celeborn/pull/3569#issuecomment-3698442078

   > > > > @zuston that's a good question, you can find a simple example in my 
test 
https://github.com/apache/celeborn/pull/3569/files#diff-744d5d0f093bff68eee23d0ed6ab4c4ae3b5fc438342c32865acb4629060ab3dR191-R233
   > > > > basically there is a delay for ShuffleDependency to completely be 
GCed, so multiple jobs may reuse the output from the same upstream stage.
   > > > 
   > > > 
   > > > Thanks for your quick reply. @CodingCat Does this issue also occur in 
pure Spark SQL use cases?
   > > 
   > > 
   > > depends....
   > > for happy path... in SQL, even there are multiple jobs shown in UI for a 
single query, they are essentially either sequential or parallel stages which 
are "converted" to jobs by AQE... and they already capture the correct lineage 
, so we won't delete shuffle too early
   > > however, the premature deletion still happens in some cases, e.g. if we 
have 3 stages, A -> B -> C
   > > after B is finished, basically shuffles generated by A can be deleted, , 
now C hits a fetch failure, it has to restart B which cannot find shuffle 
generated by A anymore as it is deleted "too early" (this is a case for both 
SQL/non-SQL tho)
   > 
   > Thanks, fully understood. From the job’s perspective, both stage retries 
and application reruns introduce significant overhead and cost if early shuffle 
deletion happens
   
   yeah, that's true, this feature is kind of aggressive, we essentially 
handpicked jobs to rollout internally (by analyzing their lineage)
   
   I think I may need to give some guidance about how to use this feature. 
Rolling out to all jobs in one shot is dangerous and may not get the benefit 
you expected... this feature is likely to get you most benefits if you only 
rollout them to those huge shuffle jobs. For PINS, we have X jobs which shuffle 
more than Y TBs , only rolling out to them already saved us a huge bill..... we 
also got benefit of enabling some jobs which we have never been able to run.... 
e.g. we have k-ways join jobs which shuffle Z+ PBs , without this feature, we 
need 100s of servers to exclusively serve this job.... now, as we delete 
shuffle  in a more timely manner, we only need 10s of machines ... 
   
   additionally, fetch failure is the biggest enemy. sometimes, your server 
just doesn't deliver data... then with this feature , you need to pay more on 
retry... so you need to ensure that your servers are in a good shape by tuning 
whatever you need... and also you may want to monitor fetch failure occurrence 
in cluster wide to take actions when it is needed
   
   
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to