[GitHub] spark issue #21698: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

cloud-fan Sun, 15 Jul 2018 07:33:48 -0700

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/21698
  
    I think all we discussed here is about promise and expectation, and your 
zip example seems is not an issue: according to the promise that Spark does not 
guarantee output order of unsorted RDD(task may see different order when it 
gets retired) and users will expect your zip example to return random values. 
The first thing we need to agree on is: this is not a general issue, it's only 
about `repartition`. I think it's pretty clear that users will not expect 
`rdd.repartition(x)count` to return a different result from `rdd.count`. cc 
@JoshRosen @zsxwing 
    
    > It does not fix the issue when a child stage has one or more completed 
tasks.
    
    The approach here is to recompute all the partitions of the child stage, 
the completed task will be ignored. Do I miss something here? If we can't fix 
the issue for `repartition`, this patch must be rejected. @jiangxb1987 please 
explain more about this fix and make sure it works.
    
    > It causes performance regression to existing workaround.
    
    The perf regression only happens when users call `repartition` and have 
more shuffles after `repartition`, and some `repartition` tasks fail. That 
said, users only need to pay the perf penalty when they are going to have the 
wrong answer.
    
    > The common workaround for this issue is to checkpoint + action or do a 
local/global sort
    
    Do you have any reference to the common workaround? It would be better to 
follow a common solution.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark issue #21698: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

Reply via email to