Github user tgravescs commented on the issue:
https://github.com/apache/spark/pull/21698
thinking about the sorting thing again but I don't think it works unless
you sort both the keys and the values themselves.
for instance lets say we have a groupby key which generates:
B = k1, { (a11, a12, a13), (a21, a22, a23), (a11, a12, a13)}
Then we do a distinct on it, you could get:
C =((a11, a12, a13) ,(a21, a22, a23))
But say you get fetch failures and you refetch the data. Since the reducer
can fetch it in any order during the shuffle, B could be
B = k1, { (a21, a22, a23), (a11, a12, a13), (a11, a12, a13)}
and distinct you end up with
C =((a21, a22, a23), (a11, a12, a13))
If you were to just sort that C could be in a different place and then
round robin the output, the above example of C would go to different partitions.
I'm not sure of a way to get around this without using the hash partitioner
but still looking into it. I need to look at the details about how the
dataframe pr solved this as well as I'm curious how it works with above
example.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]