[
https://issues.apache.org/jira/browse/SPARK-30882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Nandini malempati updated SPARK-30882:
--------------------------------------
Summary: Inaccurate results with higher precision ApproximatePercentile
results (was: Inaccurate results even with higher precision
ApproximatePercentile results )
> Inaccurate results with higher precision ApproximatePercentile results
> -----------------------------------------------------------------------
>
> Key: SPARK-30882
> URL: https://issues.apache.org/jira/browse/SPARK-30882
> Project: Spark
> Issue Type: Bug
> Components: Java API
> Affects Versions: 2.4.4
> Reporter: Nandini malempati
> Priority: Major
>
> Results of ApproximatePercentile should have better accuracy with increased
> precision as per the documentation provided here:
> [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/ApproximatePercentile.scala]
> But i'm seeing nondeterministic behavior . On a data set of size 4500000 with
> Accuracy 1000000 is returning better results than 5000000 (P25 and P90 looks
> fine but P75 is very off). And accuracy with 7000000 gives exact results. But
> this behavior is not consistent.
> {code:scala}
> // Some comments here
> package com.microsoft.teams.test.utils
> import org.apache.spark.sql.{Column, DataFrame, Row, SparkSession}
> import org.scalatest.{Assertions, FunSpec}
> import org.apache.spark.sql.functions.{lit, _}
> import
> org.apache.spark.sql.catalyst.expressions.aggregate.ApproximatePercentile
> import org.apache.spark.sql.types.{IntegerType, StringType, StructField,
> StructType}
> import scala.collection.mutable.ListBuffer
> class PercentilesTest extends FunSpec with Assertions {
> it("check percentiles with different precision") {
> val schema = List(StructField("MetricName", StringType),
> StructField("DataPoint", IntegerType))
> val data = new ListBuffer[Row]
> for(i <- 1 to 4500000) { data += Row("metric", i)}
> import spark.implicits._
> val df = createDF(schema, data.toSeq)
> val accuracy1000 =
> df.groupBy("MetricName").agg(percentile_approx($"DataPoint",
> typedLit(Seq(0.1, 0.25, 0.75, 0.9)),
> ApproximatePercentile.DEFAULT_PERCENTILE_ACCURACY))
> val accuracy1M =
> df.groupBy("MetricName").agg(percentile_approx($"DataPoint",
> typedLit(Seq(0.1, 0.25, 0.75, 0.9)), 1000000))
> val accuracy5M =
> df.groupBy("MetricName").agg(percentile_approx($"DataPoint",
> typedLit(Seq(0.1, 0.25, 0.75, 0.9)), 5000000))
> val accuracy7M =
> df.groupBy("MetricName").agg(percentile_approx($"DataPoint",
> typedLit(Seq(0.1, 0.25, 0.75, 0.9)), 7000000))
> val accuracy10M =
> df.groupBy("MetricName").agg(percentile_approx($"DataPoint",
> typedLit(Seq(0.1, 0.25, 0.75, 0.9)), 10000000))
> accuracy1000.show(1, false)
> accuracy1M.show(1, false)
> accuracy5M.show(1, false)
> accuracy7M.show(1, false)
> accuracy10M.show(1, false)
> }
> def percentile_approx(col: Column, percentage: Column, accuracy: Column):
> Column = {
> val expr = new ApproximatePercentile(
> col.expr, percentage.expr, accuracy.expr
> ).toAggregateExpression
> new Column(expr)
> }
> def percentile_approx(col: Column, percentage: Column, accuracy: Int):
> Column = percentile_approx(
> col, percentage, lit(accuracy)
> )
> lazy val spark: SparkSession = {
> SparkSession
> .builder()
> .master("local")
> .appName("spark tests")
> .getOrCreate()
> }
> def createDF(schema: List[StructField], data: Seq[Row]): DataFrame = {
> spark.createDataFrame(
> spark.sparkContext.parallelize(data),
> StructType(schema))
> }
> }
> {code}
> Above is a test run to reproduce the error. Below are few runs with different
> accuracies. P25 and P90 looks fine. But P75 is the max of the column.
> +----------+----------------------------------------------------------+
> |MetricName|percentile_approx(DataPoint, [0.1,0.25,0.75,0.9], 7000000)|
> +----------+----------------------------------------------------------+
> |metric |[450000, 1125000, 3375000, 4050000] |
> +----------+----------------------------------------------------------+
>
> +----------+----------------------------------------------------------+
> |MetricName|percentile_approx(DataPoint, [0.1,0.25,0.75,0.9], 5000000)|
> +----------+----------------------------------------------------------+
> |metric |[450000, 1125000, 4500000, 4050000] |
> +----------+----------------------------------------------------------+
>
> +----------+----------------------------------------------------------+
> |MetricName|percentile_approx(DataPoint, [0.1,0.25,0.75,0.9], 1000000)|
> +----------+----------------------------------------------------------+
> |metric |[450000, 1124998, 3374996, 4050000] |
> +----------+----------------------------------------------------------+
>
> +----------+--------------------------------------------------------+
> |MetricName|percentile_approx(DataPoint, [0.1,0.25,0.75,0.9], 10000)|
> +----------+--------------------------------------------------------+
> |metric |[450000, 1124848, 3374638, 4050000] |
> +----------+--------------------------------------------------------+
> This is breaking our reports as there is no proper definition of accuracy .
> we have data sets of size more than 27000000. After studying the pattern
> found that inaccurate percentiles always have "max" of the column as value.
> P50 and P99 might be right in few cases but P75 can be wrong.
> Is there a way to define what the correct accuracy would be for a given
> dataset size ?
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]