CodingCat commented on PR #3569:
URL: https://github.com/apache/celeborn/pull/3569#issuecomment-3698419099

   > > @zuston that's a good question, you can find a simple example in my test 
https://github.com/apache/celeborn/pull/3569/files#diff-744d5d0f093bff68eee23d0ed6ab4c4ae3b5fc438342c32865acb4629060ab3dR191-R233
   > > basically there is a delay for ShuffleDependency to completely be GCed, 
so multiple jobs may reuse the output from the same upstream stage.
   > 
   > Thanks for your quick reply. @CodingCat Does this issue also occur in pure 
Spark SQL use cases?
   
   depends....
   
   for happy path... in SQL, even there are multiple jobs shown in UI for a 
single query, they are essentially either sequential or parallel stages which 
are "converted" to jobs by AQE... and they already capture the correct lineage 
, so we won't delete shuffle too early 
   
   
   however, the premature deletion still happens in some cases, e.g. if we have 
3 stages,  A -> B -> C 
   
   after B is finished, basically shuffles generated by A can be deleted, , now 
C hits a fetch failure, it has to restart B which cannot find  shuffle 
generated by A anymore as it is deleted "too early" (this is a case for both 
SQL/non-SQL tho)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to