CodingCat commented on PR #3569: URL: https://github.com/apache/celeborn/pull/3569#issuecomment-3698617368
we have many jobs like rdd1->rdd2->rdd3-> s3 rdd1 generated 500TB, and it holds there for 24 hours since rdd3 needs that many hours to be dumped to s3, and we have no way to release the reference of shuffle dependency object since rdd1->2->3 is a reference chain and rdd 3 is not release until the job finished i don't think non remote shuffle spark users care this feature as much, because a huge computing cluster can have 10s of 1000s of nodes, and tasks being spread everywhere will just amortize the disk pressure to a broad range of machines ..celeborn users do care about this, as we cannot have that many celeborn machines due to the cost constraints -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
