I'm inclined to agree. Just saying that it is not a regression doesn't
really cut it when it is a now known data correctness issue. We need
something a lot more than nothing before releasing 2.4.0. At a barest
minimum, that has to be much more complete and publicly highlighted
documentation of the issue so that users are less likely to stumble into
this unaware; but really we need to fix at least the most common cases of
this bug. Backports to maintenance branches are also probably in order.

On Wed, Aug 8, 2018 at 7:06 AM Imran Rashid <iras...@cloudera.com.invalid>
wrote:

> On Tue, Aug 7, 2018 at 8:39 AM, Wenchen Fan <cloud0...@gmail.com> wrote:
>>
>> SPARK-23243 <https://issues.apache.org/jira/browse/SPARK-23243>: 
>> Shuffle+Repartition
>> on an RDD could lead to incorrect answers
>> It turns out to be a very complicated issue, there is no consensus about
>> what is the right fix yet. Likely to miss it in Spark 2.4 because it's a
>> long-standing issue, not a regression.
>>
>
> This is a really serious data loss bug.  Yes its very complex, but we
> absolutely have to fix this, I really think it should be in 2.4.
> Has worked on it stopped?
>

Reply via email to