[
https://issues.apache.org/jira/browse/SPARK-30882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Nandini malempati updated SPARK-30882:
--------------------------------------
Summary: Inaccurate ApproximatePercentile results (was:
ApproximatePercentile results )
> Inaccurate ApproximatePercentile results
> -----------------------------------------
>
> Key: SPARK-30882
> URL: https://issues.apache.org/jira/browse/SPARK-30882
> Project: Spark
> Issue Type: Bug
> Components: Java API
> Affects Versions: 2.4.4
> Reporter: Nandini malempati
> Priority: Major
>
> Results of ApproximatePercentile should have better accuracy with increased
> precision as per the documentation provided here:
> [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/ApproximatePercentile.scala]
> But i'm seeing nondeterministic behavior . On a data set of size 4500000 with
> Accuracy 1000000 is returning better results than 5000000. And accuracy with
> 7000000 gives exact results. But this behavior is not consistent.
> {code:java}
> // code placeholder
> {code}
> package com.microsoft.teams.test.utilsimport org.apache.spark.sql.\{Column,
> DataFrame, Row, SparkSession} import org.scalatest.\{Assertions, FunSpec}
> import org.apache.spark.sql.functions.\{lit, _} import
> org.apache.spark.sql.catalyst.expressions.aggregate.ApproximatePercentile
> import org.apache.spark.sql.types.\{IntegerType, StringType, StructField,
> StructType}import scala.collection.mutable.ListBuffer class PercentilesTest
> extends FunSpec with Assertions \{ it("check percentiles with different
> precision") { val schema = List(StructField("MetricName", StringType),
> StructField("DataPoint", IntegerType)) val data = new ListBuffer[Row] for(i
> <- 1 to 4500000) { data += Row("metric", i)} import spark.implicits._ val df
> = createDF(schema, data.toSeq) val accuracy1000 =
> df.groupBy("MetricName").agg(percentile_approx($"DataPoint",
> typedLit(Seq(0.1, 0.25, 0.75, 0.9)),
> ApproximatePercentile.DEFAULT_PERCENTILE_ACCURACY)) val accuracy1M =
> df.groupBy("MetricName").agg(percentile_approx($"DataPoint",
> typedLit(Seq(0.1, 0.25, 0.75, 0.9)), 1000000)) val accuracy5M =
> df.groupBy("MetricName").agg(percentile_approx($"DataPoint",
> typedLit(Seq(0.1, 0.25, 0.75, 0.9)), 5000000)) val accuracy7M =
> df.groupBy("MetricName").agg(percentile_approx($"DataPoint",
> typedLit(Seq(0.1, 0.25, 0.75, 0.9)), 7000000)) val accuracy10M =
> df.groupBy("MetricName").agg(percentile_approx($"DataPoint",
> typedLit(Seq(0.1, 0.25, 0.75, 0.9)), 10000000)) accuracy1000.show(1, false)
> accuracy1M.show(1, false) accuracy5M.show(1, false) accuracy7M.show(1, false)
> accuracy10M.show(1, false) } def percentile_approx(col: Column, percentage:
> Column, accuracy: Column): Column = \{ val expr = new ApproximatePercentile(
> col.expr, percentage.expr, accuracy.expr ).toAggregateExpression new
> Column(expr) } def percentile_approx(col: Column, percentage: Column,
> accuracy: Int): Column = percentile_approx( col, percentage, lit(accuracy) )
> lazy val spark: SparkSession = \{ SparkSession .builder() .master("local")
> .appName("spark tests") .getOrCreate() } def createDF(schema:
> List[StructField], data: Seq[Row]): DataFrame = \{ spark.createDataFrame(
> spark.sparkContext.parallelize(data), StructType(schema)) } }
> Above is a test run to reproduce the error. In this example with accuracy
> 5000000 , P25 and P90 looks fine. But P75 is the max of the column.
> +----------+----------------------------------------------------------+
> |MetricName|percentile_approx(DataPoint, [0.1,0.25,0.75,0.9], 5000000)|
> +----------+----------------------------------------------------------+
> |metric |[450000, 1125000, 4500000, 4050000] |
> +----------+----------------------------------------------------------+
> This is breaking our reports as there is no proper definition of accuracy .
> we have data sets of size more than 27000000. After studying the pattern
> found that inaccurate percentiles always have "max" of the column as value.
> P50 and P99 might be right in few cases but P75 can be wrong.
> Is there a way to define what the correct accuracy would be for a given
> dataset size ?
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]