TL;DR - We already had the fix from SPARK-5984. The delta from the current
JDK implementation to Spark's looks actually inconsequential. No action
required AFAICT.

On Fri, Aug 31, 2018 at 12:30 PM Sean Owen <sro...@gmail.com> wrote:

> I looked into this, because it sure sounds like a similar issue from a few
> years ago that was fixed in
> https://issues.apache.org/jira/browse/SPARK-5984
> The change in that JIRA actually looks almost identical to the change
> mentioned in the JDK bug:
> http://hg.openjdk.java.net/jdk/jdk/rev/3a6d47df8239
>
> Reading the paper
> http://drops.dagstuhl.de/opus/volltexte/2018/9467/pdf/LIPIcs-ESA-2018-4.pdf in
> section 5 a little more, I think they are saying that there were two ways
> to fix the original problem: a) fix the invariant, or b) increase some data
> structure size. Java did the latter, it seems, and now they've shown it's
> actually still busted. However Python and the original paper did the
> former, and we seem to have copied that fix from
> http://envisage-project.eu/proving-android-java-and-python-sorting-algorithm-is-broken-and-how-to-fix-it/
>  My
> understanding is that this still works, and is what Java *now* implements.
>
> The only difference I can see in implementation is an extra check for a
> negative array index before dereferencing array[n]. We can add that for
> full consistency with the JVM change, I suppose. I don't think it's
> relevant to the problem reported in the paper, but could be an issue
> otherwise. The JVM implementation suggests it thinks this needs to be
> guarded.
>
> I did hack together a crude translation of the paper's bug reproduction at
> http://igm.univ-mlv.fr/~pivoteau/Timsort/Test.java by copying some Spark
> test code. It does need a huge amount of memory to run (>32g.. ended up at
> 44g) so not so feasible to put in the test suite. Running it on Spark
> master nets a JVM crash:
>
> # Problematic frame:
> # J 10195 C2
> org.apache.spark.util.collection.TimSort.countRunAndMakeAscending(Ljava/lang/Object;IILjava/util/Comparator;)I
> (198 bytes) @ 0x00007ff64dd9a262 [0x00007ff64dd99f20+0x342]
>
> Thats... not good, but I can't tell if it's really due to this same issue
> or something else going on in the off-heap code. Making the tiny change I
> mentioned above doesn't do anything.
>
> On Fri, Aug 31, 2018 at 2:37 AM Reynold Xin <r...@databricks.com> wrote:
>
>> “As a byproduct of our study, we uncover a bug in the Java
>> implementation that can cause the sorting method to fail during the
>> execution.”
>>
>> http://drops.dagstuhl.de/opus/volltexte/2018/9467/
>>
>> This might impact Spark since we took the Java based TimSort
>> implementation. I have seen in the wild TimSort failing in the past. Maybe
>> this is the cause.
>>
>

Reply via email to