CodingCat commented on PR #3569: URL: https://github.com/apache/celeborn/pull/3569#issuecomment-3698442078
> > > > @zuston that's a good question, you can find a simple example in my test https://github.com/apache/celeborn/pull/3569/files#diff-744d5d0f093bff68eee23d0ed6ab4c4ae3b5fc438342c32865acb4629060ab3dR191-R233 > > > > basically there is a delay for ShuffleDependency to completely be GCed, so multiple jobs may reuse the output from the same upstream stage. > > > > > > > > > Thanks for your quick reply. @CodingCat Does this issue also occur in pure Spark SQL use cases? > > > > > > depends.... > > for happy path... in SQL, even there are multiple jobs shown in UI for a single query, they are essentially either sequential or parallel stages which are "converted" to jobs by AQE... and they already capture the correct lineage , so we won't delete shuffle too early > > however, the premature deletion still happens in some cases, e.g. if we have 3 stages, A -> B -> C > > after B is finished, basically shuffles generated by A can be deleted, , now C hits a fetch failure, it has to restart B which cannot find shuffle generated by A anymore as it is deleted "too early" (this is a case for both SQL/non-SQL tho) > > Thanks, fully understood. From the job’s perspective, both stage retries and application reruns introduce significant overhead and cost if early shuffle deletion happens yeah, that's true, this feature is kind of aggressive, we essentially handpicked jobs to rollout internally (by analyzing their lineage) I think I may need to give some guidance about how to use this feature. Rolling out to all jobs in one shot is dangerous and may not get the benefit you expected... this feature is likely to get you most benefits if you only rollout them to those huge shuffle jobs. For PINS, we have X jobs which shuffle more than Y TBs , only rolling out to them already saved us a huge bill..... we also got benefit of enabling some jobs which we have never been able to run.... e.g. we have k-ways join jobs which shuffle Z+ PBs , without this feature, we need 100s of servers to exclusively serve this job.... now, as we delete shuffle in a more timely manner, we only need 10s of machines ... additionally, fetch failure is the biggest enemy. sometimes, your server just doesn't deliver data... then with this feature , you need to pay more on retry... so you need to ensure that your servers are in a good shape by tuning whatever you need... and also you may want to monitor fetch failure occurrence in cluster wide to take actions when it is needed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
