Github user mridulm commented on the issue:
https://github.com/apache/spark/pull/21698
Taking a step back and analyzing the solution for the problem at hand.
There are three main issues with the proposal:
* It does not solve the problem in a general manner.
* I gave example of zip, sample - it applies to any order sensitive
closure.
* It does not fix the issue when a child stage has one or more completed
tasks.
* Even if we assume it is a specific fix for repartition/coalasce - even
there it does not solve the problem and can cause data loss.
* It causes performance regression to existing workaround.
* The common workaround for this issue is to checkpoint + action or do a
local/global sort (I believe sql does the latter now ?).
* The proposal causes performance regression for these existing
workarounds.
The corner case where the proposal works is if :
a) order sensitive stage has finished and
b) no task in child stage has finished fetching its shuffle input.
This is a fairly narrow subset, and why I dont believe the current approach
helps.
Having said that, if it is possible to enhance the approach, that would be
great !
This is a fairly nasty issue which hurts users, and typically people who
are aware of the problem tend to always pay a performance cost to avoid the
corner case.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]