[
https://issues.apache.org/jira/browse/SPARK-29325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sean R. Owen resolved SPARK-29325.
----------------------------------
Resolution: Duplicate
> approxQuantile() results are incorrect and vary significantly for small
> changes in relativeError
> ------------------------------------------------------------------------------------------------
>
> Key: SPARK-29325
> URL: https://issues.apache.org/jira/browse/SPARK-29325
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 2.3.2, 2.4.4
> Environment: I was using OSX 10.14.6.
> I was using Scala 2.11.12 and Spark 2.4.4.
> I also verified the bug exists for Scala 2.11.8 and Spark 2.3.2.
> Reporter: James Verbus
> Priority: Major
> Labels: correctness
> Attachments: 20191001_example_data_approx_quantile_bug.zip
>
>
> The [approxQuantile()
> method|https://github.com/apache/spark/blob/3b1674cb1f244598463e879477d89632b0817578/sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala#L40]
> returns sometimes incorrect results that are sensitively dependent upon the
> choice of the relativeError.
> Below is an example in the latest Spark version (2.4.4). You can see the
> result varies significantly for modest changes in the specified relativeError
> parameter. The result varies much more than would be expected based upon the
> relativeError parameter.
>
> {code:java}
> Welcome to
> ____ __
> / __/__ ___ _____/ /__
> _\ \/ _ \/ _ `/ __/ '_/
> /___/ .__/\_,_/_/ /_/\_\ version 2.4.4
> /_/
>
> Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java
> 1.8.0_212)
> Type in expressions to have them evaluated.
> Type :help for more information.
> scala> val df = spark.read.format("csv").option("header",
> "true").option("inferSchema",
> "true").load("./20191001_example_data_approx_quantile_bug")
> df: org.apache.spark.sql.DataFrame = [value: double]
> scala> df.stat.approxQuantile("value", Array(0.9), 0)
> res0: Array[Double] = Array(0.5929591082174609)
> scala> df.stat.approxQuantile("value", Array(0.9), 0.001)
> res1: Array[Double] = Array(0.67621027121925)
> scala> df.stat.approxQuantile("value", Array(0.9), 0.002)
> res2: Array[Double] = Array(0.5926195654486178)
> scala> df.stat.approxQuantile("value", Array(0.9), 0.003)
> res3: Array[Double] = Array(0.5924693999048418)
> scala> df.stat.approxQuantile("value", Array(0.9), 0.004)
> res4: Array[Double] = Array(0.67621027121925)
> scala> df.stat.approxQuantile("value", Array(0.9), 0.005)
> res5: Array[Double] = Array(0.5923925937051544)
> {code}
> I attached a zip file containing the data used for the above example
> demonstrating the bug.
> Also, the following demonstrates that there is data for intermediate quantile
> values between the 0.5926195654486178 and 0.67621027121925 values observed
> above.
> {code:java}
> scala> df.stat.approxQuantile("value", Array(0.9), 0.0)
> res10: Array[Double] = Array(0.5929591082174609)
> scala> df.stat.approxQuantile("value", Array(0.91), 0.0)
> res11: Array[Double] = Array(0.5966354540849995)
> scala> df.stat.approxQuantile("value", Array(0.92), 0.0)
> res12: Array[Double] = Array(0.6015676591185595)
> scala> df.stat.approxQuantile("value", Array(0.93), 0.0)
> res13: Array[Double] = Array(0.6029240823799614)
> scala> df.stat.approxQuantile("value", Array(0.94), 0.0)
> res14: Array[Double] = Array(0.6117645471000034)
> scala> df.stat.approxQuantile("value", Array(0.95), 0.0)
> res15: Array[Double] = Array(0.6185162204274052)
> scala> df.stat.approxQuantile("value", Array(0.96), 0.0)
> res16: Array[Double] = Array(0.625983000807062)
> scala> df.stat.approxQuantile("value", Array(0.97), 0.0)
> res17: Array[Double] = Array(0.6306892943412258)
> scala> df.stat.approxQuantile("value", Array(0.98), 0.0)
> res18: Array[Double] = Array(0.6365567375994333)
> scala> df.stat.approxQuantile("value", Array(0.99), 0.0)
> res19: Array[Double] = Array(0.6554479197566019)
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]