[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

mridulm Wed, 22 Aug 2018 12:17:27 -0700

Github user mridulm commented on the issue:

    https://github.com/apache/spark/pull/22112
  
    Catching up on discussion ...
    
    @cloud-fan
    > shuffled RDD will never be deterministic unless the shuffle key is the 
entire record and key ordering is specified. 
    
    Let me rephrase that - key ordering with aggregator specified.
    Unfortunately this will then mean it is applicable only to custom user code 
- since default spark api's do not set both.
    
    > The reduce task fetches multiple remote shuffle blocks at the same time, 
so the order is always random.
    
    This is not a characteristics of shuffle in MR based systems, but an 
implementation detail of shuffle in spark.
    In hadoop mapreduce, for example, shuffle output is always ordered and this 
problem does not occur.
    
    >  In Addition, Spark SQL never specifies key ordering.
    
    Spark SQL has re-implemented a lot of the spark core primitives - given 
this, I would expect spark sql to :
    * When there is a rdd view gets generated off a dataframe, a local sort be 
introduced where appropriate - as has already been done in SPARK-23207 for 
repartition case. and/or
    * appropriately expose IDEMPOTENT, UNORDERED and INDETERMINATE in RDD view.
    
    @tgravescs 
    > I don't agree that " We actually cannot support random output". Users can 
do this now in MR and spark and we can't really stop them other then say we 
don't support and if you do failure handling will cause different results.
    
    What I mentioned was not specific to spark, but general to any MR like 
system.
    This applies even in hadoop mapreduce and used to be a bug in some of our 
pig udf's :-)
    For example, if there is random output generated in mapper and there are 
node failures during reducer phase (after all mapper's have completed), the 
exact same problem would occur with random mapper output.
    We cannot, ofcourse, stop users from doing it - but we do not guarantee 
correct results (just as hadoop mapreduce does not in this scenario).
    
    >  I don't want us to document it away now and then change our mind in next 
release. Our end decision should be final.
    
    My current thought is as follows:
    
    Without making shuffle output order repeatable, we do not have a way to 
properly fix this.
    My understanding from @jiangxb1987, who has looked at it in detail with 
@sameeragarwal and others, is that this is a very difficult invariant to 
achieve in current spark codebase for shuffle in general.
    (Please holler if I am off base @jiangxb1987 !)
    
    With the assumption that we cannot currently fix this - explicitly warn'ing 
user and/or reschedule all tasks/stages for correctness might be a good stop 
gap.
    User's could mitigate the performance impact via checkpoint'ing [1] - I 
would expect this to be the go-to solution; for any non trivial job, the perf 
characteristics and SLA violations are going to be terrible after this patch is 
applied when failures occur : but we should not have any data loss.
    
    In future, we might resolve this issue in a more principled manner.
    
    [1] As @cloud-fan's pointed out 
[here|https://github.com/apache/spark/pull/22112#issuecomment-414034703] sort 
is not gaurantee'ed to work - unless key's are unique : since ordering is 
defined only on key and not value (and so value re-order can occur).
    
    @cloud-fan 
    > This is the problem we are resolving here. This assumption is incorrect, 
and the RDD closure should handle it, or use what I proposed in this PR: the 
retry strategy.
    
    I would disagree with this - this is an artifact of implementation detail 
of spark shuffle - and is not the expected behavior for a MR based system.
    Unfortunately, this has been the behavior since beginning IMO (atleast 
since 0.6)
    IMO this was not a conscious design choice, but rather an oversight.
    
    > IIRC @mridulm didn't agree with it. One problem is that, it's hard for 
users to realize that Spark returns wrong result, so they don't know when to 
handle it.
    
    Actually I would expect user's to end up doing either of these two - the 
perf characteristics and lack of predictability in SLA after this patch are 
going to force users to choose one of the two.




---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

Reply via email to