Robert Joseph Evans created SPARK-45599:
-------------------------------------------
Summary: Percentile can produce a wrong answer if -0.0 and 0.0 are
mixed in the dataset
Key: SPARK-45599
URL: https://issues.apache.org/jira/browse/SPARK-45599
Project: Spark
Issue Type: Bug
Components: SQL
Affects Versions: 3.5.0, 3.2.3, 3.3.0
Reporter: Robert Joseph Evans
I think this actually impacts all versions that have ever supported percentile
and it may impact other things because the bug is in OpenHashMap.
I am really surprised that we caught this bug because everything has to hit
just wrong to make it happen. in python/pyspark if you run
{code:python}
from math import *
from pyspark.sql.types import *
data = [(1.779652973678931e+173,), (9.247723870123388e-295,),
(5.891823952773268e+98,), (inf,), (1.9042708096454302e+195,),
(-3.085825028509117e+74,), (-1.9569489404314425e+128,),
(2.0738138203216883e+201,), (inf,), (2.5212410617263588e-282,),
(-2.646144697462316e-35,), (-3.468683249247593e-196,), (nan,), (None,), (nan,),
(1.822129180806602e-245,), (5.211702553315461e-259,), (-1.0,),
(-5.682293414619055e+46,), (-4.585039307326895e+166,),
(-5.936844510098297e-82,), (-5234708055733.116,), (4920675036.053339,),
(None,), (4.4501477170144023e-308,), (2.176024662699802e-210,),
(-5.046677974902737e+132,), (-5.490780063080251e-09,),
(1.703824427218836e-55,), (-1.1961155424160076e+102,),
(1.4403274475565667e+41,), (None,), (5.4470705929955455e-86,),
(5.120795466142678e-215,), (-9.01991342808203e+282,),
(4.051866849943636e-254,), (-3588518231990.927,), (-1.8891559842111865e+63,),
(3.4543959813437507e-304,), (-7.590734560275502e-63,),
(9.376528689861087e+117,), (-2.1696969883753554e-292,),
(7.227411393136537e+206,), (-2.428999624265911e-293,),
(5.741383583382542e-14,), (-1.4882040107841963e+286,),
(2.1973064836362255e-159,), (0.028096279323357867,), (8.475809563703283e-64,),
(3.002803065141241e-139,), (-1.1041009815645263e+203,),
(1.8461539468514548e-225,), (-5.620339412794757e-251,),
(3.5103766991437114e-60,), (2.4925669515657655e+165,),
(3.217759099462207e+108,), (-8.796717685143486e+203,),
(2.037360925124577e+292,), (-6.542279108216022e+206,),
(-7.951172614280046e-74,), (6.226527569272003e+152,),
(-5.673977270111637e-84,), (-1.0186016078084965e-281,),
(1.7976931348623157e+308,), (4.205809391029644e+137,),
(-9.871721037428167e+119,), (None,), (-1.6663254121185628e-256,),
(1.0075153091760986e-236,), (-0.0,), (0.0,), (1.7976931348623157e+308,),
(4.3214483342777574e-117,), (-7.973642629411105e-89,),
(-1.1028137694801181e-297,), (2.9000325280299273e-39,),
(-1.077534929323113e-264,), (-1.1847952892216515e+137,), (nan,),
(7.849390806334983e+226,), (-1.831402251805194e+65,),
(-2.664533698035492e+203,), (-2.2385155698231885e+285,),
(-2.3016388448634844e-155,), (-9.607772864590422e+217,),
(3.437191836077251e+209,), (1.9846569552093057e-137,),
(-3.010452936419635e-233,), (1.4309793775440402e-87,),
(-2.9383643865423363e-103,), (-4.696878567317712e-162,),
(8.391630779050713e-135,), (nan,), (-3.3885098786542755e-128,),
(-4.5154178008513483e-122,), (nan,), (nan,), (2.187766760184779e+306,),
(7.679268835670585e+223,), (6.3131466321042515e+153,),
(1.779652973678931e+173,), (9.247723870123388e-295,), (5.891823952773268e+98,),
(inf,), (1.9042708096454302e+195,), (-3.085825028509117e+74,),
(-1.9569489404314425e+128,), (2.0738138203216883e+201,), (inf,),
(2.5212410617263588e-282,), (-2.646144697462316e-35,),
(-3.468683249247593e-196,), (nan,), (None,), (nan,), (1.822129180806602e-245,),
(5.211702553315461e-259,), (-1.0,), (-5.682293414619055e+46,),
(-4.585039307326895e+166,), (-5.936844510098297e-82,), (-5234708055733.116,),
(4920675036.053339,), (None,), (4.4501477170144023e-308,),
(2.176024662699802e-210,), (-5.046677974902737e+132,),
(-5.490780063080251e-09,), (1.703824427218836e-55,),
(-1.1961155424160076e+102,), (1.4403274475565667e+41,), (None,),
(5.4470705929955455e-86,), (5.120795466142678e-215,),
(-9.01991342808203e+282,), (4.051866849943636e-254,), (-3588518231990.927,),
(-1.8891559842111865e+63,), (3.4543959813437507e-304,),
(-7.590734560275502e-63,), (9.376528689861087e+117,),
(-2.1696969883753554e-292,), (7.227411393136537e+206,),
(-2.428999624265911e-293,), (5.741383583382542e-14,),
(-1.4882040107841963e+286,), (2.1973064836362255e-159,),
(0.028096279323357867,), (8.475809563703283e-64,), (3.002803065141241e-139,),
(-1.1041009815645263e+203,), (1.8461539468514548e-225,),
(-5.620339412794757e-251,), (3.5103766991437114e-60,),
(2.4925669515657655e+165,), (3.217759099462207e+108,),
(-8.796717685143486e+203,), (2.037360925124577e+292,),
(-6.542279108216022e+206,), (-7.951172614280046e-74,),
(6.226527569272003e+152,), (-5.673977270111637e-84,),
(-1.0186016078084965e-281,), (1.7976931348623157e+308,),
(4.205809391029644e+137,), (-9.871721037428167e+119,), (None,),
(-1.6663254121185628e-256,), (1.0075153091760986e-236,), (-0.0,), (0.0,),
(1.7976931348623157e+308,), (4.3214483342777574e-117,),
(-7.973642629411105e-89,), (-1.1028137694801181e-297,),
(2.9000325280299273e-39,), (-1.077534929323113e-264,),
(-1.1847952892216515e+137,), (nan,), (7.849390806334983e+226,),
(-1.831402251805194e+65,), (-2.664533698035492e+203,),
(-2.2385155698231885e+285,), (-2.3016388448634844e-155,),
(-9.607772864590422e+217,), (3.437191836077251e+209,),
(1.9846569552093057e-137,), (-3.010452936419635e-233,),
(1.4309793775440402e-87,), (-2.9383643865423363e-103,),
(-4.696878567317712e-162,), (8.391630779050713e-135,), (nan,),
(-3.3885098786542755e-128,), (-4.5154178008513483e-122,), (nan,), (nan,),
(2.187766760184779e+306,), (7.679268835670585e+223,),
(6.3131466321042515e+153,), (1.779652973678931e+173,),
(9.247723870123388e-295,), (5.891823952773268e+98,), (inf,),
(1.9042708096454302e+195,), (-3.085825028509117e+74,),
(-1.9569489404314425e+128,), (2.0738138203216883e+201,), (inf,),
(2.5212410617263588e-282,), (-2.646144697462316e-35,),
(-3.468683249247593e-196,), (nan,), (None,), (nan,), (1.822129180806602e-245,),
(5.211702553315461e-259,), (-1.0,), (-5.682293414619055e+46,),
(-4.585039307326895e+166,), (-5.936844510098297e-82,), (-5234708055733.116,),
(4920675036.053339,), (None,), (4.4501477170144023e-308,),
(2.176024662699802e-210,), (-5.046677974902737e+132,),
(-5.490780063080251e-09,), (1.703824427218836e-55,),
(-1.1961155424160076e+102,), (1.4403274475565667e+41,), (None,),
(5.4470705929955455e-86,), (5.120795466142678e-215,),
(-9.01991342808203e+282,), (4.051866849943636e-254,), (-3588518231990.927,),
(-1.8891559842111865e+63,), (3.4543959813437507e-304,),
(-7.590734560275502e-63,), (9.376528689861087e+117,),
(-2.1696969883753554e-292,), (7.227411393136537e+206,),
(-2.428999624265911e-293,), (5.741383583382542e-14,),
(-1.4882040107841963e+286,), (2.1973064836362255e-159,),
(0.028096279323357867,), (8.475809563703283e-64,), (3.002803065141241e-139,),
(-1.1041009815645263e+203,), (1.8461539468514548e-225,),
(-5.620339412794757e-251,), (3.5103766991437114e-60,),
(2.4925669515657655e+165,), (3.217759099462207e+108,),
(-8.796717685143486e+203,), (2.037360925124577e+292,),
(-6.542279108216022e+206,), (-7.951172614280046e-74,),
(6.226527569272003e+152,), (-5.673977270111637e-84,),
(-1.0186016078084965e-281,), (1.7976931348623157e+308,),
(4.205809391029644e+137,), (-9.871721037428167e+119,), (None,),
(-1.6663254121185628e-256,), (1.0075153091760986e-236,), (-0.0,), (0.0,),
(1.7976931348623157e+308,), (4.3214483342777574e-117,),
(-7.973642629411105e-89,), (-1.1028137694801181e-297,),
(2.9000325280299273e-39,), (-1.077534929323113e-264,),
(-1.1847952892216515e+137,), (nan,), (7.849390806334983e+226,),
(-1.831402251805194e+65,), (-2.664533698035492e+203,),
(-2.2385155698231885e+285,), (-2.3016388448634844e-155,),
(-9.607772864590422e+217,), (3.437191836077251e+209,),
(1.9846569552093057e-137,), (-3.010452936419635e-233,),
(1.4309793775440402e-87,), (-2.9383643865423363e-103,),
(-4.696878567317712e-162,), (8.391630779050713e-135,), (nan,),
(-3.3885098786542755e-128,), (-4.5154178008513483e-122,), (nan,), (nan,),
(2.187766760184779e+306,), (7.679268835670585e+223,),
(6.3131466321042515e+153,), (1.779652973678931e+173,),
(9.247723870123388e-295,), (5.891823952773268e+98,), (inf,),
(1.9042708096454302e+195,), (-3.085825028509117e+74,),
(-1.9569489404314425e+128,), (2.0738138203216883e+201,), (inf,),
(2.5212410617263588e-282,), (-2.646144697462316e-35,),
(-3.468683249247593e-196,), (nan,), (None,), (nan,), (1.822129180806602e-245,),
(5.211702553315461e-259,), (-1.0,), (-5.682293414619055e+46,),
(-4.585039307326895e+166,), (-5.936844510098297e-82,), (-5234708055733.116,),
(4920675036.053339,), (None,), (4.4501477170144023e-308,),
(2.176024662699802e-210,), (-5.046677974902737e+132,),
(-5.490780063080251e-09,), (1.703824427218836e-55,),
(-1.1961155424160076e+102,), (1.4403274475565667e+41,), (None,),
(5.4470705929955455e-86,), (5.120795466142678e-215,),
(-9.01991342808203e+282,), (4.051866849943636e-254,), (-3588518231990.927,),
(-1.8891559842111865e+63,), (3.4543959813437507e-304,),
(-7.590734560275502e-63,), (9.376528689861087e+117,),
(-2.1696969883753554e-292,), (7.227411393136537e+206,),
(-2.428999624265911e-293,), (5.741383583382542e-14,),
(-1.4882040107841963e+286,), (2.1973064836362255e-159,),
(0.028096279323357867,), (8.475809563703283e-64,), (3.002803065141241e-139,),
(-1.1041009815645263e+203,), (1.8461539468514548e-225,),
(-5.620339412794757e-251,), (3.5103766991437114e-60,),
(2.4925669515657655e+165,), (3.217759099462207e+108,),
(-8.796717685143486e+203,), (2.037360925124577e+292,),
(-6.542279108216022e+206,), (-7.951172614280046e-74,),
(6.226527569272003e+152,), (-5.673977270111637e-84,),
(-1.0186016078084965e-281,), (1.7976931348623157e+308,),
(4.205809391029644e+137,), (-9.871721037428167e+119,), (None,),
(-1.6663254121185628e-256,), (1.0075153091760986e-236,), (-0.0,), (0.0,),
(1.7976931348623157e+308,), (4.3214483342777574e-117,)]
df = spark.createDataFrame(SparkContext.getOrCreate().parallelize(data,
numSlices=4), StructType([StructField('val',DoubleType(),True)]))
df.selectExpr('percentile(val, 0.1)').show(truncate=False){code}
You will get back a result of {{-5.924228780007003E136}} but the correct answer
is {{-4.739483957565084E136}} which can be verified by changing the number of
slices used to import the data into spark.
What is happening is that we are getting super unlucky. In the 4th input
partition the data is read in from 7.849390806334983E226 to the end. This works
fine and we get an OpenHashMap with an entry for both 0.0 and -0.0
{code:java}
OpenHashMap((1.0075153091760986E-236,1), (0.0,1), (-2.646144697462316E-35,1),
(-7.951172614280046E-74,1), (-3.468683249247593E-196,1),
(5.741383583382542E-14,1), (-2.664533698035492E203,1),
(7.227411393136537E206,1), (-3.588518231990927E12,1),
(1.9042708096454302E195,1), (-1.9569489404314425E128,1),
(7.849390806334983E226,1), (2.187766760184779E306,1),
(2.4925669515657655E165,1), (-5.620339412794757E-251,1), (-0.0,1),
(-1.1041009815645263E203,1), (-2.3016388448634844E-155,1),
(2.1973064836362255E-159,1), (-1.831402251805194E65,1),
(1.4403274475565667E41,1), (-3.085825028509117E74,1),
(-6.542279108216022E206,1), (-9.871721037428167E119,1),
(8.475809563703283E-64,1), (-5.673977270111637E-84,1),
(5.120795466142678E-215,1), (-5.046677974902737E132,1),
(-4.5154178008513483E-122,1), (1.9846569552093057E-137,1),
(-3.3885098786542755E-128,1), (7.679268835670585E223,1),
(4.920675036053339E9,1), (-1.0,1), (-4.585039307326895E166,1),
(-9.01991342808203E282,1), (5.4470705929955455E-86,1),
(9.247723870123388E-295,1), (-1.8891559842111865E63,1),
(-4.696878567317712E-162,1), (-1.4882040107841963E286,1),
(-5.936844510098297E-82,1), (6.226527569272003E152,1),
(-1.1961155424160076E102,1), (-1.6663254121185628E-256,1),
(4.4501477170144023E-308,1), (-9.607772864590422E217,1),
(-3.010452936419635E-233,1), (4.051866849943636E-254,1),
(1.4309793775440402E-87,1), (2.5212410617263588E-282,1),
(3.4543959813437507E-304,1), (0.028096279323357867,1),
(-7.590734560275502E-63,1), (5.211702553315461E-259,1),
(-1.0186016078084965E-281,1), (3.437191836077251E209,1), (NaN,1), (NaN,1),
(NaN,1), (8.391630779050713E-135,1), (-5.490780063080251E-9,1),
(-2.9383643865423363E-103,1), (2.0738138203216883E201,1),
(1.8461539468514548E-225,1), (1.822129180806602E-245,1), (NaN,1),
(4.205809391029644E137,1), (2.037360925124577E292,1),
(-2.428999624265911E-293,1), (3.002803065141241E-139,1),
(6.3131466321042515E153,1), (3.5103766991437114E-60,1),
(3.217759099462207E108,1), (-8.796717685143486E203,1),
(-2.2385155698231885E285,1), (2.176024662699802E-210,1),
(1.703824427218836E-55,1), (9.376528689861087E117,1),
(1.7976931348623157E308,2), (NaN,1), (5.891823952773268E98,1),
(-2.1696969883753554E-292,1), (4.3214483342777574E-117,1), (Infinity,2),
(1.779652973678931E173,1), (-5.234708055733116E12,1), (-5.682293414619055E46,1))
{code}
But when we go to deserialize the map after the shuffle we get a different
result out with the entry for -0.0 gone. That is because the deserialize code
uses update that will overwrite the count if the keys are the same. But -0.0
and 0.0 are not the same. Unless they happen to hash to the same position when
they are being added in. In that case they end up being equal to each other and
the count for 0.0 is replaced with the count for -0.0 and we lose one row in
the data.
Because the keys are stored in OpenHashSet that is where the bug actually is.
[https://github.com/apache/spark/blob/7e82e1bc43e0297c3036d802b3a151d2b93db2f6/core/src/main/scala/org/apache/spark/util/collection/OpenHashSet.scala#L139-L159]
I see a few ways to fix this.
# Update OpenHashSet/OpenHashMap to do the right thing for floats and doubles
around -0.0 and 0.0
# normalize nans and zeros before doing percentiles
# reinterpret the bits for float/double as an int/long before putting them
into the map and do the reverse when we pull them out. That would also have
the advantage of making NaN == NaN which would reduce the size of the map in
those cases for percentile.
# Update the deserialize code for percentile to not call update, but instead
to call {{counts.changeValue(key, count, _ + count)}}
I am not sure if something similar can happen in other places, but I know for
hash aggregate/etc we normalize the floating point values because of things
like this.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]