CodingCat commented on PR #3569: URL: https://github.com/apache/celeborn/pull/3569#issuecomment-3698419099
> > @zuston that's a good question, you can find a simple example in my test https://github.com/apache/celeborn/pull/3569/files#diff-744d5d0f093bff68eee23d0ed6ab4c4ae3b5fc438342c32865acb4629060ab3dR191-R233 > > basically there is a delay for ShuffleDependency to completely be GCed, so multiple jobs may reuse the output from the same upstream stage. > > Thanks for your quick reply. @CodingCat Does this issue also occur in pure Spark SQL use cases? depends.... for happy path... in SQL, even there are multiple jobs shown in UI for a single query, they are essentially either sequential or parallel stages which are "converted" to jobs by AQE... and they already capture the correct lineage , so we won't delete shuffle too early however, the premature deletion still happens in some cases, e.g. if we have 3 stages, A -> B -> C after B is finished, basically shuffles generated by A can be deleted, , now C hits a fetch failure, it has to restart B which cannot find shuffle generated by A anymore as it is deleted "too early" (this is a case for both SQL/non-SQL tho) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
