Github user mridulm commented on the issue:
https://github.com/apache/spark/pull/22112
Catching up on discussion ...
@cloud-fan
> shuffled RDD will never be deterministic unless the shuffle key is the
entire record and key ordering is specified.
Let me rephrase that - key ordering with aggregator specified.
Unfortunately this will then mean it is applicable only to custom user code
- since default spark api's do not set both.
> The reduce task fetches multiple remote shuffle blocks at the same time,
so the order is always random.
This is not a characteristics of shuffle in MR based systems, but an
implementation detail of shuffle in spark.
In hadoop mapreduce, for example, shuffle output is always ordered and this
problem does not occur.
> In Addition, Spark SQL never specifies key ordering.
Spark SQL has re-implemented a lot of the spark core primitives - given
this, I would expect spark sql to :
* When there is a rdd view gets generated off a dataframe, a local sort be
introduced where appropriate - as has already been done in SPARK-23207 for
repartition case. and/or
* appropriately expose IDEMPOTENT, UNORDERED and INDETERMINATE in RDD view.
@tgravescs
> I don't agree that " We actually cannot support random output". Users can
do this now in MR and spark and we can't really stop them other then say we
don't support and if you do failure handling will cause different results.
What I mentioned was not specific to spark, but general to any MR like
system.
This applies even in hadoop mapreduce and used to be a bug in some of our
pig udf's :-)
For example, if there is random output generated in mapper and there are
node failures during reducer phase (after all mapper's have completed), the
exact same problem would occur with random mapper output.
We cannot, ofcourse, stop users from doing it - but we do not guarantee
correct results (just as hadoop mapreduce does not in this scenario).
> I don't want us to document it away now and then change our mind in next
release. Our end decision should be final.
My current thought is as follows:
Without making shuffle output order repeatable, we do not have a way to
properly fix this.
My understanding from @jiangxb1987, who has looked at it in detail with
@sameeragarwal and others, is that this is a very difficult invariant to
achieve in current spark codebase for shuffle in general.
(Please holler if I am off base @jiangxb1987 !)
With the assumption that we cannot currently fix this - explicitly warn'ing
user and/or reschedule all tasks/stages for correctness might be a good stop
gap.
User's could mitigate the performance impact via checkpoint'ing [1] - I
would expect this to be the go-to solution; for any non trivial job, the perf
characteristics and SLA violations are going to be terrible after this patch is
applied when failures occur : but we should not have any data loss.
In future, we might resolve this issue in a more principled manner.
[1] As @cloud-fan's pointed out
[here|https://github.com/apache/spark/pull/22112#issuecomment-414034703] sort
is not gaurantee'ed to work - unless key's are unique : since ordering is
defined only on key and not value (and so value re-order can occur).
@cloud-fan
> This is the problem we are resolving here. This assumption is incorrect,
and the RDD closure should handle it, or use what I proposed in this PR: the
retry strategy.
I would disagree with this - this is an artifact of implementation detail
of spark shuffle - and is not the expected behavior for a MR based system.
Unfortunately, this has been the behavior since beginning IMO (atleast
since 0.6)
IMO this was not a conscious design choice, but rather an oversight.
> IIRC @mridulm didn't agree with it. One problem is that, it's hard for
users to realize that Spark returns wrong result, so they don't know when to
handle it.
Actually I would expect user's to end up doing either of these two - the
perf characteristics and lack of predictability in SLA after this patch are
going to force users to choose one of the two.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]