TL;DR - We already had the fix from SPARK-5984. The delta from the current JDK implementation to Spark's looks actually inconsequential. No action required AFAICT.
On Fri, Aug 31, 2018 at 12:30 PM Sean Owen <sro...@gmail.com> wrote: > I looked into this, because it sure sounds like a similar issue from a few > years ago that was fixed in > https://issues.apache.org/jira/browse/SPARK-5984 > The change in that JIRA actually looks almost identical to the change > mentioned in the JDK bug: > http://hg.openjdk.java.net/jdk/jdk/rev/3a6d47df8239 > > Reading the paper > http://drops.dagstuhl.de/opus/volltexte/2018/9467/pdf/LIPIcs-ESA-2018-4.pdf in > section 5 a little more, I think they are saying that there were two ways > to fix the original problem: a) fix the invariant, or b) increase some data > structure size. Java did the latter, it seems, and now they've shown it's > actually still busted. However Python and the original paper did the > former, and we seem to have copied that fix from > http://envisage-project.eu/proving-android-java-and-python-sorting-algorithm-is-broken-and-how-to-fix-it/ > My > understanding is that this still works, and is what Java *now* implements. > > The only difference I can see in implementation is an extra check for a > negative array index before dereferencing array[n]. We can add that for > full consistency with the JVM change, I suppose. I don't think it's > relevant to the problem reported in the paper, but could be an issue > otherwise. The JVM implementation suggests it thinks this needs to be > guarded. > > I did hack together a crude translation of the paper's bug reproduction at > http://igm.univ-mlv.fr/~pivoteau/Timsort/Test.java by copying some Spark > test code. It does need a huge amount of memory to run (>32g.. ended up at > 44g) so not so feasible to put in the test suite. Running it on Spark > master nets a JVM crash: > > # Problematic frame: > # J 10195 C2 > org.apache.spark.util.collection.TimSort.countRunAndMakeAscending(Ljava/lang/Object;IILjava/util/Comparator;)I > (198 bytes) @ 0x00007ff64dd9a262 [0x00007ff64dd99f20+0x342] > > Thats... not good, but I can't tell if it's really due to this same issue > or something else going on in the off-heap code. Making the tiny change I > mentioned above doesn't do anything. > > On Fri, Aug 31, 2018 at 2:37 AM Reynold Xin <r...@databricks.com> wrote: > >> “As a byproduct of our study, we uncover a bug in the Java >> implementation that can cause the sorting method to fail during the >> execution.” >> >> http://drops.dagstuhl.de/opus/volltexte/2018/9467/ >> >> This might impact Spark since we took the Java based TimSort >> implementation. I have seen in the wild TimSort failing in the past. Maybe >> this is the cause. >> >