Robert Joseph Evans created SPARK-45599:
-------------------------------------------

             Summary: Percentile can produce a wrong answer if -0.0 and 0.0 are 
mixed in the dataset
                 Key: SPARK-45599
                 URL: https://issues.apache.org/jira/browse/SPARK-45599
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 3.5.0, 3.2.3, 3.3.0
            Reporter: Robert Joseph Evans


I think this actually impacts all versions that have ever supported percentile 
and it may impact other things because the bug is in OpenHashMap.

 

I am really surprised that we caught this bug because everything has to hit 
just wrong to make it happen. in python/pyspark if you run

 
{code:python}
from math import *
from pyspark.sql.types import *

data = [(1.779652973678931e+173,), (9.247723870123388e-295,), 
(5.891823952773268e+98,), (inf,), (1.9042708096454302e+195,), 
(-3.085825028509117e+74,), (-1.9569489404314425e+128,), 
(2.0738138203216883e+201,), (inf,), (2.5212410617263588e-282,), 
(-2.646144697462316e-35,), (-3.468683249247593e-196,), (nan,), (None,), (nan,), 
(1.822129180806602e-245,), (5.211702553315461e-259,), (-1.0,), 
(-5.682293414619055e+46,), (-4.585039307326895e+166,), 
(-5.936844510098297e-82,), (-5234708055733.116,), (4920675036.053339,), 
(None,), (4.4501477170144023e-308,), (2.176024662699802e-210,), 
(-5.046677974902737e+132,), (-5.490780063080251e-09,), 
(1.703824427218836e-55,), (-1.1961155424160076e+102,), 
(1.4403274475565667e+41,), (None,), (5.4470705929955455e-86,), 
(5.120795466142678e-215,), (-9.01991342808203e+282,), 
(4.051866849943636e-254,), (-3588518231990.927,), (-1.8891559842111865e+63,), 
(3.4543959813437507e-304,), (-7.590734560275502e-63,), 
(9.376528689861087e+117,), (-2.1696969883753554e-292,), 
(7.227411393136537e+206,), (-2.428999624265911e-293,), 
(5.741383583382542e-14,), (-1.4882040107841963e+286,), 
(2.1973064836362255e-159,), (0.028096279323357867,), (8.475809563703283e-64,), 
(3.002803065141241e-139,), (-1.1041009815645263e+203,), 
(1.8461539468514548e-225,), (-5.620339412794757e-251,), 
(3.5103766991437114e-60,), (2.4925669515657655e+165,), 
(3.217759099462207e+108,), (-8.796717685143486e+203,), 
(2.037360925124577e+292,), (-6.542279108216022e+206,), 
(-7.951172614280046e-74,), (6.226527569272003e+152,), 
(-5.673977270111637e-84,), (-1.0186016078084965e-281,), 
(1.7976931348623157e+308,), (4.205809391029644e+137,), 
(-9.871721037428167e+119,), (None,), (-1.6663254121185628e-256,), 
(1.0075153091760986e-236,), (-0.0,), (0.0,), (1.7976931348623157e+308,), 
(4.3214483342777574e-117,), (-7.973642629411105e-89,), 
(-1.1028137694801181e-297,), (2.9000325280299273e-39,), 
(-1.077534929323113e-264,), (-1.1847952892216515e+137,), (nan,), 
(7.849390806334983e+226,), (-1.831402251805194e+65,), 
(-2.664533698035492e+203,), (-2.2385155698231885e+285,), 
(-2.3016388448634844e-155,), (-9.607772864590422e+217,), 
(3.437191836077251e+209,), (1.9846569552093057e-137,), 
(-3.010452936419635e-233,), (1.4309793775440402e-87,), 
(-2.9383643865423363e-103,), (-4.696878567317712e-162,), 
(8.391630779050713e-135,), (nan,), (-3.3885098786542755e-128,), 
(-4.5154178008513483e-122,), (nan,), (nan,), (2.187766760184779e+306,), 
(7.679268835670585e+223,), (6.3131466321042515e+153,), 
(1.779652973678931e+173,), (9.247723870123388e-295,), (5.891823952773268e+98,), 
(inf,), (1.9042708096454302e+195,), (-3.085825028509117e+74,), 
(-1.9569489404314425e+128,), (2.0738138203216883e+201,), (inf,), 
(2.5212410617263588e-282,), (-2.646144697462316e-35,), 
(-3.468683249247593e-196,), (nan,), (None,), (nan,), (1.822129180806602e-245,), 
(5.211702553315461e-259,), (-1.0,), (-5.682293414619055e+46,), 
(-4.585039307326895e+166,), (-5.936844510098297e-82,), (-5234708055733.116,), 
(4920675036.053339,), (None,), (4.4501477170144023e-308,), 
(2.176024662699802e-210,), (-5.046677974902737e+132,), 
(-5.490780063080251e-09,), (1.703824427218836e-55,), 
(-1.1961155424160076e+102,), (1.4403274475565667e+41,), (None,), 
(5.4470705929955455e-86,), (5.120795466142678e-215,), 
(-9.01991342808203e+282,), (4.051866849943636e-254,), (-3588518231990.927,), 
(-1.8891559842111865e+63,), (3.4543959813437507e-304,), 
(-7.590734560275502e-63,), (9.376528689861087e+117,), 
(-2.1696969883753554e-292,), (7.227411393136537e+206,), 
(-2.428999624265911e-293,), (5.741383583382542e-14,), 
(-1.4882040107841963e+286,), (2.1973064836362255e-159,), 
(0.028096279323357867,), (8.475809563703283e-64,), (3.002803065141241e-139,), 
(-1.1041009815645263e+203,), (1.8461539468514548e-225,), 
(-5.620339412794757e-251,), (3.5103766991437114e-60,), 
(2.4925669515657655e+165,), (3.217759099462207e+108,), 
(-8.796717685143486e+203,), (2.037360925124577e+292,), 
(-6.542279108216022e+206,), (-7.951172614280046e-74,), 
(6.226527569272003e+152,), (-5.673977270111637e-84,), 
(-1.0186016078084965e-281,), (1.7976931348623157e+308,), 
(4.205809391029644e+137,), (-9.871721037428167e+119,), (None,), 
(-1.6663254121185628e-256,), (1.0075153091760986e-236,), (-0.0,), (0.0,), 
(1.7976931348623157e+308,), (4.3214483342777574e-117,), 
(-7.973642629411105e-89,), (-1.1028137694801181e-297,), 
(2.9000325280299273e-39,), (-1.077534929323113e-264,), 
(-1.1847952892216515e+137,), (nan,), (7.849390806334983e+226,), 
(-1.831402251805194e+65,), (-2.664533698035492e+203,), 
(-2.2385155698231885e+285,), (-2.3016388448634844e-155,), 
(-9.607772864590422e+217,), (3.437191836077251e+209,), 
(1.9846569552093057e-137,), (-3.010452936419635e-233,), 
(1.4309793775440402e-87,), (-2.9383643865423363e-103,), 
(-4.696878567317712e-162,), (8.391630779050713e-135,), (nan,), 
(-3.3885098786542755e-128,), (-4.5154178008513483e-122,), (nan,), (nan,), 
(2.187766760184779e+306,), (7.679268835670585e+223,), 
(6.3131466321042515e+153,), (1.779652973678931e+173,), 
(9.247723870123388e-295,), (5.891823952773268e+98,), (inf,), 
(1.9042708096454302e+195,), (-3.085825028509117e+74,), 
(-1.9569489404314425e+128,), (2.0738138203216883e+201,), (inf,), 
(2.5212410617263588e-282,), (-2.646144697462316e-35,), 
(-3.468683249247593e-196,), (nan,), (None,), (nan,), (1.822129180806602e-245,), 
(5.211702553315461e-259,), (-1.0,), (-5.682293414619055e+46,), 
(-4.585039307326895e+166,), (-5.936844510098297e-82,), (-5234708055733.116,), 
(4920675036.053339,), (None,), (4.4501477170144023e-308,), 
(2.176024662699802e-210,), (-5.046677974902737e+132,), 
(-5.490780063080251e-09,), (1.703824427218836e-55,), 
(-1.1961155424160076e+102,), (1.4403274475565667e+41,), (None,), 
(5.4470705929955455e-86,), (5.120795466142678e-215,), 
(-9.01991342808203e+282,), (4.051866849943636e-254,), (-3588518231990.927,), 
(-1.8891559842111865e+63,), (3.4543959813437507e-304,), 
(-7.590734560275502e-63,), (9.376528689861087e+117,), 
(-2.1696969883753554e-292,), (7.227411393136537e+206,), 
(-2.428999624265911e-293,), (5.741383583382542e-14,), 
(-1.4882040107841963e+286,), (2.1973064836362255e-159,), 
(0.028096279323357867,), (8.475809563703283e-64,), (3.002803065141241e-139,), 
(-1.1041009815645263e+203,), (1.8461539468514548e-225,), 
(-5.620339412794757e-251,), (3.5103766991437114e-60,), 
(2.4925669515657655e+165,), (3.217759099462207e+108,), 
(-8.796717685143486e+203,), (2.037360925124577e+292,), 
(-6.542279108216022e+206,), (-7.951172614280046e-74,), 
(6.226527569272003e+152,), (-5.673977270111637e-84,), 
(-1.0186016078084965e-281,), (1.7976931348623157e+308,), 
(4.205809391029644e+137,), (-9.871721037428167e+119,), (None,), 
(-1.6663254121185628e-256,), (1.0075153091760986e-236,), (-0.0,), (0.0,), 
(1.7976931348623157e+308,), (4.3214483342777574e-117,), 
(-7.973642629411105e-89,), (-1.1028137694801181e-297,), 
(2.9000325280299273e-39,), (-1.077534929323113e-264,), 
(-1.1847952892216515e+137,), (nan,), (7.849390806334983e+226,), 
(-1.831402251805194e+65,), (-2.664533698035492e+203,), 
(-2.2385155698231885e+285,), (-2.3016388448634844e-155,), 
(-9.607772864590422e+217,), (3.437191836077251e+209,), 
(1.9846569552093057e-137,), (-3.010452936419635e-233,), 
(1.4309793775440402e-87,), (-2.9383643865423363e-103,), 
(-4.696878567317712e-162,), (8.391630779050713e-135,), (nan,), 
(-3.3885098786542755e-128,), (-4.5154178008513483e-122,), (nan,), (nan,), 
(2.187766760184779e+306,), (7.679268835670585e+223,), 
(6.3131466321042515e+153,), (1.779652973678931e+173,), 
(9.247723870123388e-295,), (5.891823952773268e+98,), (inf,), 
(1.9042708096454302e+195,), (-3.085825028509117e+74,), 
(-1.9569489404314425e+128,), (2.0738138203216883e+201,), (inf,), 
(2.5212410617263588e-282,), (-2.646144697462316e-35,), 
(-3.468683249247593e-196,), (nan,), (None,), (nan,), (1.822129180806602e-245,), 
(5.211702553315461e-259,), (-1.0,), (-5.682293414619055e+46,), 
(-4.585039307326895e+166,), (-5.936844510098297e-82,), (-5234708055733.116,), 
(4920675036.053339,), (None,), (4.4501477170144023e-308,), 
(2.176024662699802e-210,), (-5.046677974902737e+132,), 
(-5.490780063080251e-09,), (1.703824427218836e-55,), 
(-1.1961155424160076e+102,), (1.4403274475565667e+41,), (None,), 
(5.4470705929955455e-86,), (5.120795466142678e-215,), 
(-9.01991342808203e+282,), (4.051866849943636e-254,), (-3588518231990.927,), 
(-1.8891559842111865e+63,), (3.4543959813437507e-304,), 
(-7.590734560275502e-63,), (9.376528689861087e+117,), 
(-2.1696969883753554e-292,), (7.227411393136537e+206,), 
(-2.428999624265911e-293,), (5.741383583382542e-14,), 
(-1.4882040107841963e+286,), (2.1973064836362255e-159,), 
(0.028096279323357867,), (8.475809563703283e-64,), (3.002803065141241e-139,), 
(-1.1041009815645263e+203,), (1.8461539468514548e-225,), 
(-5.620339412794757e-251,), (3.5103766991437114e-60,), 
(2.4925669515657655e+165,), (3.217759099462207e+108,), 
(-8.796717685143486e+203,), (2.037360925124577e+292,), 
(-6.542279108216022e+206,), (-7.951172614280046e-74,), 
(6.226527569272003e+152,), (-5.673977270111637e-84,), 
(-1.0186016078084965e-281,), (1.7976931348623157e+308,), 
(4.205809391029644e+137,), (-9.871721037428167e+119,), (None,), 
(-1.6663254121185628e-256,), (1.0075153091760986e-236,), (-0.0,), (0.0,), 
(1.7976931348623157e+308,), (4.3214483342777574e-117,)]

df = spark.createDataFrame(SparkContext.getOrCreate().parallelize(data, 
numSlices=4), StructType([StructField('val',DoubleType(),True)]))

df.selectExpr('percentile(val, 0.1)').show(truncate=False){code}
You will get back a result of {{-5.924228780007003E136}} but the correct answer 
is {{-4.739483957565084E136}} which can be verified by changing the number of 
slices used to import the data into spark.

 

What is happening is that we are getting super unlucky. In the 4th input 
partition the data is read in from 7.849390806334983E226 to the end. This works 
fine and we get an OpenHashMap with an entry for both 0.0 and -0.0

 
{code:java}
 OpenHashMap((1.0075153091760986E-236,1), (0.0,1), (-2.646144697462316E-35,1), 
(-7.951172614280046E-74,1), (-3.468683249247593E-196,1), 
(5.741383583382542E-14,1), (-2.664533698035492E203,1), 
(7.227411393136537E206,1), (-3.588518231990927E12,1), 
(1.9042708096454302E195,1), (-1.9569489404314425E128,1), 
(7.849390806334983E226,1), (2.187766760184779E306,1), 
(2.4925669515657655E165,1), (-5.620339412794757E-251,1), (-0.0,1), 
(-1.1041009815645263E203,1), (-2.3016388448634844E-155,1), 
(2.1973064836362255E-159,1), (-1.831402251805194E65,1), 
(1.4403274475565667E41,1), (-3.085825028509117E74,1), 
(-6.542279108216022E206,1), (-9.871721037428167E119,1), 
(8.475809563703283E-64,1), (-5.673977270111637E-84,1), 
(5.120795466142678E-215,1), (-5.046677974902737E132,1), 
(-4.5154178008513483E-122,1), (1.9846569552093057E-137,1), 
(-3.3885098786542755E-128,1), (7.679268835670585E223,1), 
(4.920675036053339E9,1), (-1.0,1), (-4.585039307326895E166,1), 
(-9.01991342808203E282,1), (5.4470705929955455E-86,1), 
(9.247723870123388E-295,1), (-1.8891559842111865E63,1), 
(-4.696878567317712E-162,1), (-1.4882040107841963E286,1), 
(-5.936844510098297E-82,1), (6.226527569272003E152,1), 
(-1.1961155424160076E102,1), (-1.6663254121185628E-256,1), 
(4.4501477170144023E-308,1), (-9.607772864590422E217,1), 
(-3.010452936419635E-233,1), (4.051866849943636E-254,1), 
(1.4309793775440402E-87,1), (2.5212410617263588E-282,1), 
(3.4543959813437507E-304,1), (0.028096279323357867,1), 
(-7.590734560275502E-63,1), (5.211702553315461E-259,1), 
(-1.0186016078084965E-281,1), (3.437191836077251E209,1), (NaN,1), (NaN,1), 
(NaN,1), (8.391630779050713E-135,1), (-5.490780063080251E-9,1), 
(-2.9383643865423363E-103,1), (2.0738138203216883E201,1), 
(1.8461539468514548E-225,1), (1.822129180806602E-245,1), (NaN,1), 
(4.205809391029644E137,1), (2.037360925124577E292,1), 
(-2.428999624265911E-293,1), (3.002803065141241E-139,1), 
(6.3131466321042515E153,1), (3.5103766991437114E-60,1), 
(3.217759099462207E108,1), (-8.796717685143486E203,1), 
(-2.2385155698231885E285,1), (2.176024662699802E-210,1), 
(1.703824427218836E-55,1), (9.376528689861087E117,1), 
(1.7976931348623157E308,2), (NaN,1), (5.891823952773268E98,1), 
(-2.1696969883753554E-292,1), (4.3214483342777574E-117,1), (Infinity,2), 
(1.779652973678931E173,1), (-5.234708055733116E12,1), (-5.682293414619055E46,1))
{code}
But when we go to deserialize the map after the shuffle we get a different 
result out with the entry for -0.0 gone. That is because the deserialize code 
uses update that will overwrite the count if the keys are the same. But -0.0 
and 0.0 are not the same. Unless they happen to hash to the same position when 
they are being added in. In that case they end up being equal to each other and 
the count for 0.0 is replaced with the count for -0.0 and we lose one row in 
the data.

 

Because the keys are stored in OpenHashSet that is where the bug actually is. 
[https://github.com/apache/spark/blob/7e82e1bc43e0297c3036d802b3a151d2b93db2f6/core/src/main/scala/org/apache/spark/util/collection/OpenHashSet.scala#L139-L159]

 

I see a few ways to fix this.
 # Update OpenHashSet/OpenHashMap to do the right thing for floats and doubles 
around -0.0 and 0.0
 # normalize nans and zeros before doing percentiles
 # reinterpret the bits for float/double as an int/long before putting them 
into the map and do the reverse when we pull them out.  That would also have 
the advantage of making NaN == NaN which would reduce the size of the map in 
those cases for percentile.
 # Update the deserialize code for percentile to not call update, but instead 
to call {{counts.changeValue(key, count, _ + count)}}

I am not sure if something similar can happen in other places, but I know for 
hash aggregate/etc we normalize the floating point values because of things 
like this.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to