[GitHub] spark issue #21698: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

jiangxb1987 Sat, 07 Jul 2018 09:12:18 -0700

Github user jiangxb1987 commented on the issue:

    https://github.com/apache/spark/pull/21698
  
    Thank you for your comments @mridulm !
    We focus on resolving the RDD.repartition() correctness issue here in this 
PR, because it is most commonly used, and that we can still address the 
RDD.zip* issue using the similar approach. At first I was worried that the 
changes may be huge and trying to address the correctness issue for multiple 
operations may make it difficult to backport the PR. But now it turns out that 
the PR didn't change that much code, so maybe I can consider include the 
RDD.zip* fix in this PR too.
    
    Since you are also deeply involved in the related discussion on the 
correctness issue caused by non-deterministic input for shuffle, you may also 
agree that there is actually no easy way to both guarantee correctness and 
don't cause serious performance drop-off. I have to insist that correctness 
always goes beyond performance concerns, and that we shall not expect users to 
always remember they may hit a known correctness bug in case of some use 
patterns.
    
    As for the proposed solution, there are actually two ways to follow: Either 
you insert a local sort before a shuffle repartition (that's how we fixed the 
DataFrame.repartition()), or you always retry the whole stage with repartition 
on FetchFailure. The problem with the local-sort solution is that, it can't fix 
all the problems for RDD (since the data type of an RDD can be not sortable, 
and it's hard to construct a sorting for a generic type), also it can make the 
time consumption of repartition() increases by 3X ~ 5X. By applying the 
approach proposed in this PR, the performance shall keep the same in case no 
FetchFailure happens, and it shall works well for DataFrames as well as for 
RDDs.
    
    I have to admit that if you have a big query running on a huge cluster, and 
the tasks can easily hit FetchFailure issues, then you may see the job takes 
more time to finish (or even fall due to reach max consequence stage failure 
limit). But again, your big query may be producing wrong result without a 
patch, and I have to say that is even more unacceptable :( . As for the 
cascading cost, you are right, it makes things worse, and I don't have good 
advice for that issue.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark issue #21698: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

Reply via email to