[ 
https://issues.apache.org/jira/browse/SPARK-45599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17806954#comment-17806954
 ] 

Nicholas Chammas commented on SPARK-45599:
------------------------------------------

Using [Hypothesis|https://github.com/HypothesisWorks/hypothesis], I've managed 
to shrink the provided test case from 373 elements down to 14:

{code:python}
from math import nan
from pyspark.sql import SparkSession

HYPOTHESIS_EXAMPLE = [
    (0.0,),
    (2.0,),
    (153.0,),
    (168.0,),
    (3252411229536261.0,),
    (7.205759403792794e+16,),
    (1.7976931348623157e+308,),
    (0.25,),
    (nan,),
    (nan,),
    (-0.0,),
    (-128.0,),
    (nan,),
    (nan,),
]

spark = (
    SparkSession.builder
    .config("spark.log.level", "ERROR")
    .getOrCreate()
)


def compare_percentiles(data, slices):
    rdd = spark.sparkContext.parallelize(data, numSlices=1)
    df = spark.createDataFrame(rdd, "val double")
    result1 = df.selectExpr('percentile(val, 0.1)').collect()[0][0]

    rdd = spark.sparkContext.parallelize(data, numSlices=slices)
    df = spark.createDataFrame(rdd, "val double")
    result2 = df.selectExpr('percentile(val, 0.1)').collect()[0][0]

    assert result1 == result2, f"{result1}, {result2}"


if __name__ == "__main__":
    compare_percentiles(HYPOTHESIS_EXAMPLE, 2)
{code}

Running this test fails as follows:

{code:python}
Traceback (most recent call last):                                              
  File ".../SPARK-45599.py", line 41, in <module>
    compare_percentiles(HYPOTHESIS_EXAMPLE, 2)
  File ".../SPARK-45599.py", line 37, in compare_percentiles
    assert result1 == result2, f"{result1}, {result2}"
           ^^^^^^^^^^^^^^^^^^
AssertionError: 0.050000000000000044, -0.0
{code}

> Percentile can produce a wrong answer if -0.0 and 0.0 are mixed in the dataset
> ------------------------------------------------------------------------------
>
>                 Key: SPARK-45599
>                 URL: https://issues.apache.org/jira/browse/SPARK-45599
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.4.1, 1.6.3, 3.3.0, 3.2.3, 3.5.0
>            Reporter: Robert Joseph Evans
>            Priority: Critical
>              Labels: correctness
>
> I think this actually impacts all versions that have ever supported 
> percentile and it may impact other things because the bug is in OpenHashMap.
>  
> I am really surprised that we caught this bug because everything has to hit 
> just wrong to make it happen. in python/pyspark if you run
>  
> {code:python}
> from math import *
> from pyspark.sql.types import *
> data = [(1.779652973678931e+173,), (9.247723870123388e-295,), 
> (5.891823952773268e+98,), (inf,), (1.9042708096454302e+195,), 
> (-3.085825028509117e+74,), (-1.9569489404314425e+128,), 
> (2.0738138203216883e+201,), (inf,), (2.5212410617263588e-282,), 
> (-2.646144697462316e-35,), (-3.468683249247593e-196,), (nan,), (None,), 
> (nan,), (1.822129180806602e-245,), (5.211702553315461e-259,), (-1.0,), 
> (-5.682293414619055e+46,), (-4.585039307326895e+166,), 
> (-5.936844510098297e-82,), (-5234708055733.116,), (4920675036.053339,), 
> (None,), (4.4501477170144023e-308,), (2.176024662699802e-210,), 
> (-5.046677974902737e+132,), (-5.490780063080251e-09,), 
> (1.703824427218836e-55,), (-1.1961155424160076e+102,), 
> (1.4403274475565667e+41,), (None,), (5.4470705929955455e-86,), 
> (5.120795466142678e-215,), (-9.01991342808203e+282,), 
> (4.051866849943636e-254,), (-3588518231990.927,), (-1.8891559842111865e+63,), 
> (3.4543959813437507e-304,), (-7.590734560275502e-63,), 
> (9.376528689861087e+117,), (-2.1696969883753554e-292,), 
> (7.227411393136537e+206,), (-2.428999624265911e-293,), 
> (5.741383583382542e-14,), (-1.4882040107841963e+286,), 
> (2.1973064836362255e-159,), (0.028096279323357867,), 
> (8.475809563703283e-64,), (3.002803065141241e-139,), 
> (-1.1041009815645263e+203,), (1.8461539468514548e-225,), 
> (-5.620339412794757e-251,), (3.5103766991437114e-60,), 
> (2.4925669515657655e+165,), (3.217759099462207e+108,), 
> (-8.796717685143486e+203,), (2.037360925124577e+292,), 
> (-6.542279108216022e+206,), (-7.951172614280046e-74,), 
> (6.226527569272003e+152,), (-5.673977270111637e-84,), 
> (-1.0186016078084965e-281,), (1.7976931348623157e+308,), 
> (4.205809391029644e+137,), (-9.871721037428167e+119,), (None,), 
> (-1.6663254121185628e-256,), (1.0075153091760986e-236,), (-0.0,), (0.0,), 
> (1.7976931348623157e+308,), (4.3214483342777574e-117,), 
> (-7.973642629411105e-89,), (-1.1028137694801181e-297,), 
> (2.9000325280299273e-39,), (-1.077534929323113e-264,), 
> (-1.1847952892216515e+137,), (nan,), (7.849390806334983e+226,), 
> (-1.831402251805194e+65,), (-2.664533698035492e+203,), 
> (-2.2385155698231885e+285,), (-2.3016388448634844e-155,), 
> (-9.607772864590422e+217,), (3.437191836077251e+209,), 
> (1.9846569552093057e-137,), (-3.010452936419635e-233,), 
> (1.4309793775440402e-87,), (-2.9383643865423363e-103,), 
> (-4.696878567317712e-162,), (8.391630779050713e-135,), (nan,), 
> (-3.3885098786542755e-128,), (-4.5154178008513483e-122,), (nan,), (nan,), 
> (2.187766760184779e+306,), (7.679268835670585e+223,), 
> (6.3131466321042515e+153,), (1.779652973678931e+173,), 
> (9.247723870123388e-295,), (5.891823952773268e+98,), (inf,), 
> (1.9042708096454302e+195,), (-3.085825028509117e+74,), 
> (-1.9569489404314425e+128,), (2.0738138203216883e+201,), (inf,), 
> (2.5212410617263588e-282,), (-2.646144697462316e-35,), 
> (-3.468683249247593e-196,), (nan,), (None,), (nan,), 
> (1.822129180806602e-245,), (5.211702553315461e-259,), (-1.0,), 
> (-5.682293414619055e+46,), (-4.585039307326895e+166,), 
> (-5.936844510098297e-82,), (-5234708055733.116,), (4920675036.053339,), 
> (None,), (4.4501477170144023e-308,), (2.176024662699802e-210,), 
> (-5.046677974902737e+132,), (-5.490780063080251e-09,), 
> (1.703824427218836e-55,), (-1.1961155424160076e+102,), 
> (1.4403274475565667e+41,), (None,), (5.4470705929955455e-86,), 
> (5.120795466142678e-215,), (-9.01991342808203e+282,), 
> (4.051866849943636e-254,), (-3588518231990.927,), (-1.8891559842111865e+63,), 
> (3.4543959813437507e-304,), (-7.590734560275502e-63,), 
> (9.376528689861087e+117,), (-2.1696969883753554e-292,), 
> (7.227411393136537e+206,), (-2.428999624265911e-293,), 
> (5.741383583382542e-14,), (-1.4882040107841963e+286,), 
> (2.1973064836362255e-159,), (0.028096279323357867,), 
> (8.475809563703283e-64,), (3.002803065141241e-139,), 
> (-1.1041009815645263e+203,), (1.8461539468514548e-225,), 
> (-5.620339412794757e-251,), (3.5103766991437114e-60,), 
> (2.4925669515657655e+165,), (3.217759099462207e+108,), 
> (-8.796717685143486e+203,), (2.037360925124577e+292,), 
> (-6.542279108216022e+206,), (-7.951172614280046e-74,), 
> (6.226527569272003e+152,), (-5.673977270111637e-84,), 
> (-1.0186016078084965e-281,), (1.7976931348623157e+308,), 
> (4.205809391029644e+137,), (-9.871721037428167e+119,), (None,), 
> (-1.6663254121185628e-256,), (1.0075153091760986e-236,), (-0.0,), (0.0,), 
> (1.7976931348623157e+308,), (4.3214483342777574e-117,), 
> (-7.973642629411105e-89,), (-1.1028137694801181e-297,), 
> (2.9000325280299273e-39,), (-1.077534929323113e-264,), 
> (-1.1847952892216515e+137,), (nan,), (7.849390806334983e+226,), 
> (-1.831402251805194e+65,), (-2.664533698035492e+203,), 
> (-2.2385155698231885e+285,), (-2.3016388448634844e-155,), 
> (-9.607772864590422e+217,), (3.437191836077251e+209,), 
> (1.9846569552093057e-137,), (-3.010452936419635e-233,), 
> (1.4309793775440402e-87,), (-2.9383643865423363e-103,), 
> (-4.696878567317712e-162,), (8.391630779050713e-135,), (nan,), 
> (-3.3885098786542755e-128,), (-4.5154178008513483e-122,), (nan,), (nan,), 
> (2.187766760184779e+306,), (7.679268835670585e+223,), 
> (6.3131466321042515e+153,), (1.779652973678931e+173,), 
> (9.247723870123388e-295,), (5.891823952773268e+98,), (inf,), 
> (1.9042708096454302e+195,), (-3.085825028509117e+74,), 
> (-1.9569489404314425e+128,), (2.0738138203216883e+201,), (inf,), 
> (2.5212410617263588e-282,), (-2.646144697462316e-35,), 
> (-3.468683249247593e-196,), (nan,), (None,), (nan,), 
> (1.822129180806602e-245,), (5.211702553315461e-259,), (-1.0,), 
> (-5.682293414619055e+46,), (-4.585039307326895e+166,), 
> (-5.936844510098297e-82,), (-5234708055733.116,), (4920675036.053339,), 
> (None,), (4.4501477170144023e-308,), (2.176024662699802e-210,), 
> (-5.046677974902737e+132,), (-5.490780063080251e-09,), 
> (1.703824427218836e-55,), (-1.1961155424160076e+102,), 
> (1.4403274475565667e+41,), (None,), (5.4470705929955455e-86,), 
> (5.120795466142678e-215,), (-9.01991342808203e+282,), 
> (4.051866849943636e-254,), (-3588518231990.927,), (-1.8891559842111865e+63,), 
> (3.4543959813437507e-304,), (-7.590734560275502e-63,), 
> (9.376528689861087e+117,), (-2.1696969883753554e-292,), 
> (7.227411393136537e+206,), (-2.428999624265911e-293,), 
> (5.741383583382542e-14,), (-1.4882040107841963e+286,), 
> (2.1973064836362255e-159,), (0.028096279323357867,), 
> (8.475809563703283e-64,), (3.002803065141241e-139,), 
> (-1.1041009815645263e+203,), (1.8461539468514548e-225,), 
> (-5.620339412794757e-251,), (3.5103766991437114e-60,), 
> (2.4925669515657655e+165,), (3.217759099462207e+108,), 
> (-8.796717685143486e+203,), (2.037360925124577e+292,), 
> (-6.542279108216022e+206,), (-7.951172614280046e-74,), 
> (6.226527569272003e+152,), (-5.673977270111637e-84,), 
> (-1.0186016078084965e-281,), (1.7976931348623157e+308,), 
> (4.205809391029644e+137,), (-9.871721037428167e+119,), (None,), 
> (-1.6663254121185628e-256,), (1.0075153091760986e-236,), (-0.0,), (0.0,), 
> (1.7976931348623157e+308,), (4.3214483342777574e-117,), 
> (-7.973642629411105e-89,), (-1.1028137694801181e-297,), 
> (2.9000325280299273e-39,), (-1.077534929323113e-264,), 
> (-1.1847952892216515e+137,), (nan,), (7.849390806334983e+226,), 
> (-1.831402251805194e+65,), (-2.664533698035492e+203,), 
> (-2.2385155698231885e+285,), (-2.3016388448634844e-155,), 
> (-9.607772864590422e+217,), (3.437191836077251e+209,), 
> (1.9846569552093057e-137,), (-3.010452936419635e-233,), 
> (1.4309793775440402e-87,), (-2.9383643865423363e-103,), 
> (-4.696878567317712e-162,), (8.391630779050713e-135,), (nan,), 
> (-3.3885098786542755e-128,), (-4.5154178008513483e-122,), (nan,), (nan,), 
> (2.187766760184779e+306,), (7.679268835670585e+223,), 
> (6.3131466321042515e+153,), (1.779652973678931e+173,), 
> (9.247723870123388e-295,), (5.891823952773268e+98,), (inf,), 
> (1.9042708096454302e+195,), (-3.085825028509117e+74,), 
> (-1.9569489404314425e+128,), (2.0738138203216883e+201,), (inf,), 
> (2.5212410617263588e-282,), (-2.646144697462316e-35,), 
> (-3.468683249247593e-196,), (nan,), (None,), (nan,), 
> (1.822129180806602e-245,), (5.211702553315461e-259,), (-1.0,), 
> (-5.682293414619055e+46,), (-4.585039307326895e+166,), 
> (-5.936844510098297e-82,), (-5234708055733.116,), (4920675036.053339,), 
> (None,), (4.4501477170144023e-308,), (2.176024662699802e-210,), 
> (-5.046677974902737e+132,), (-5.490780063080251e-09,), 
> (1.703824427218836e-55,), (-1.1961155424160076e+102,), 
> (1.4403274475565667e+41,), (None,), (5.4470705929955455e-86,), 
> (5.120795466142678e-215,), (-9.01991342808203e+282,), 
> (4.051866849943636e-254,), (-3588518231990.927,), (-1.8891559842111865e+63,), 
> (3.4543959813437507e-304,), (-7.590734560275502e-63,), 
> (9.376528689861087e+117,), (-2.1696969883753554e-292,), 
> (7.227411393136537e+206,), (-2.428999624265911e-293,), 
> (5.741383583382542e-14,), (-1.4882040107841963e+286,), 
> (2.1973064836362255e-159,), (0.028096279323357867,), 
> (8.475809563703283e-64,), (3.002803065141241e-139,), 
> (-1.1041009815645263e+203,), (1.8461539468514548e-225,), 
> (-5.620339412794757e-251,), (3.5103766991437114e-60,), 
> (2.4925669515657655e+165,), (3.217759099462207e+108,), 
> (-8.796717685143486e+203,), (2.037360925124577e+292,), 
> (-6.542279108216022e+206,), (-7.951172614280046e-74,), 
> (6.226527569272003e+152,), (-5.673977270111637e-84,), 
> (-1.0186016078084965e-281,), (1.7976931348623157e+308,), 
> (4.205809391029644e+137,), (-9.871721037428167e+119,), (None,), 
> (-1.6663254121185628e-256,), (1.0075153091760986e-236,), (-0.0,), (0.0,), 
> (1.7976931348623157e+308,), (4.3214483342777574e-117,)]
> df = spark.createDataFrame(SparkContext.getOrCreate().parallelize(data, 
> numSlices=4), StructType([StructField('val',DoubleType(),True)]))
> df.selectExpr('percentile(val, 0.1)').show(truncate=False){code}
> You will get back a result of {{-5.924228780007003E136}} but the correct 
> answer is {{-4.739483957565084E136}} which can be verified by changing the 
> number of slices used to import the data into spark.
>  
> What is happening is that we are getting super unlucky. In the 4th input 
> partition the data is read in from 7.849390806334983E226 to the end. This 
> works fine and we get an OpenHashMap with an entry for both 0.0 and -0.0
>  
> {code:java}
>  OpenHashMap((1.0075153091760986E-236,1), (0.0,1), 
> (-2.646144697462316E-35,1), (-7.951172614280046E-74,1), 
> (-3.468683249247593E-196,1), (5.741383583382542E-14,1), 
> (-2.664533698035492E203,1), (7.227411393136537E206,1), 
> (-3.588518231990927E12,1), (1.9042708096454302E195,1), 
> (-1.9569489404314425E128,1), (7.849390806334983E226,1), 
> (2.187766760184779E306,1), (2.4925669515657655E165,1), 
> (-5.620339412794757E-251,1), (-0.0,1), (-1.1041009815645263E203,1), 
> (-2.3016388448634844E-155,1), (2.1973064836362255E-159,1), 
> (-1.831402251805194E65,1), (1.4403274475565667E41,1), 
> (-3.085825028509117E74,1), (-6.542279108216022E206,1), 
> (-9.871721037428167E119,1), (8.475809563703283E-64,1), 
> (-5.673977270111637E-84,1), (5.120795466142678E-215,1), 
> (-5.046677974902737E132,1), (-4.5154178008513483E-122,1), 
> (1.9846569552093057E-137,1), (-3.3885098786542755E-128,1), 
> (7.679268835670585E223,1), (4.920675036053339E9,1), (-1.0,1), 
> (-4.585039307326895E166,1), (-9.01991342808203E282,1), 
> (5.4470705929955455E-86,1), (9.247723870123388E-295,1), 
> (-1.8891559842111865E63,1), (-4.696878567317712E-162,1), 
> (-1.4882040107841963E286,1), (-5.936844510098297E-82,1), 
> (6.226527569272003E152,1), (-1.1961155424160076E102,1), 
> (-1.6663254121185628E-256,1), (4.4501477170144023E-308,1), 
> (-9.607772864590422E217,1), (-3.010452936419635E-233,1), 
> (4.051866849943636E-254,1), (1.4309793775440402E-87,1), 
> (2.5212410617263588E-282,1), (3.4543959813437507E-304,1), 
> (0.028096279323357867,1), (-7.590734560275502E-63,1), 
> (5.211702553315461E-259,1), (-1.0186016078084965E-281,1), 
> (3.437191836077251E209,1), (NaN,1), (NaN,1), (NaN,1), 
> (8.391630779050713E-135,1), (-5.490780063080251E-9,1), 
> (-2.9383643865423363E-103,1), (2.0738138203216883E201,1), 
> (1.8461539468514548E-225,1), (1.822129180806602E-245,1), (NaN,1), 
> (4.205809391029644E137,1), (2.037360925124577E292,1), 
> (-2.428999624265911E-293,1), (3.002803065141241E-139,1), 
> (6.3131466321042515E153,1), (3.5103766991437114E-60,1), 
> (3.217759099462207E108,1), (-8.796717685143486E203,1), 
> (-2.2385155698231885E285,1), (2.176024662699802E-210,1), 
> (1.703824427218836E-55,1), (9.376528689861087E117,1), 
> (1.7976931348623157E308,2), (NaN,1), (5.891823952773268E98,1), 
> (-2.1696969883753554E-292,1), (4.3214483342777574E-117,1), (Infinity,2), 
> (1.779652973678931E173,1), (-5.234708055733116E12,1), 
> (-5.682293414619055E46,1))
> {code}
> But when we go to deserialize the map after the shuffle we get a different 
> result out with the entry for -0.0 gone. That is because the deserialize code 
> uses update that will overwrite the count if the keys are the same. But -0.0 
> and 0.0 are not the same. Unless they happen to hash to the same position 
> when they are being added in. In that case they end up being equal to each 
> other and the count for 0.0 is replaced with the count for -0.0 and we lose 
> one row in the data.
>  
> Because the keys are stored in OpenHashSet that is where the bug actually is. 
> [https://github.com/apache/spark/blob/7e82e1bc43e0297c3036d802b3a151d2b93db2f6/core/src/main/scala/org/apache/spark/util/collection/OpenHashSet.scala#L139-L159]
>  
> I see a few ways to fix this.
>  # Update OpenHashSet/OpenHashMap to do the right thing for floats and 
> doubles around -0.0 and 0.0
>  # normalize nans and zeros before doing percentiles
>  # reinterpret the bits for float/double as an int/long before putting them 
> into the map and do the reverse when we pull them out.  That would also have 
> the advantage of making NaN == NaN which would reduce the size of the map in 
> those cases for percentile.
>  # Update the deserialize code for percentile to not call update, but instead 
> to call {{counts.changeValue(key, count, _ + count)}}
> I am not sure if something similar can happen in other places, but I know for 
> hash aggregate/etc we normalize the floating point values because of things 
> like this.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to