I'm inclined to agree. Just saying that it is not a regression doesn't really cut it when it is a now known data correctness issue. We need something a lot more than nothing before releasing 2.4.0. At a barest minimum, that has to be much more complete and publicly highlighted documentation of the issue so that users are less likely to stumble into this unaware; but really we need to fix at least the most common cases of this bug. Backports to maintenance branches are also probably in order.
On Wed, Aug 8, 2018 at 7:06 AM Imran Rashid <iras...@cloudera.com.invalid> wrote: > On Tue, Aug 7, 2018 at 8:39 AM, Wenchen Fan <cloud0...@gmail.com> wrote: >> >> SPARK-23243 <https://issues.apache.org/jira/browse/SPARK-23243>: >> Shuffle+Repartition >> on an RDD could lead to incorrect answers >> It turns out to be a very complicated issue, there is no consensus about >> what is the right fix yet. Likely to miss it in Spark 2.4 because it's a >> long-standing issue, not a regression. >> > > This is a really serious data loss bug. Yes its very complex, but we > absolutely have to fix this, I really think it should be in 2.4. > Has worked on it stopped? >