Maxim Gekk created SPARK-32908: ---------------------------------- Summary: percentile_approx() returns incorrect results Key: SPARK-32908 URL: https://issues.apache.org/jira/browse/SPARK-32908 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.8, 3.0.2, 3.1.0 Reporter: Maxim Gekk
Read input data from the attached CSV file: {code:scala} val df = spark.read.option("header", "true") .option("inferSchema", "true") .csv("/Users/maximgekk/tmp/tr_rat_resampling_score.csv") .repartition(1) df.createOrReplaceTempView(table) {code} Calculate the 0.77 percentile with accuracy 1e-05: {code:scala} spark.sql( s"""SELECT | percentile_approx(tr_rat_resampling_score, 0.77, 100000) |FROM $table """.stripMargin).show {code} {code} +------------------------------------------------------------------------+ |percentile_approx(tr_rat_resampling_score, CAST(0.77 AS DOUBLE), 100000)| +------------------------------------------------------------------------+ | 1000| +------------------------------------------------------------------------+ {code} The same for smaller accuracy 0.001: {code} +----------------------------------------------------------------------+ |percentile_approx(tr_rat_resampling_score, CAST(0.77 AS DOUBLE), 1000)| +----------------------------------------------------------------------+ | 18| +----------------------------------------------------------------------+ {code} and better accuracy 1e-06: {code} +-------------------------------------------------------------------------+ |percentile_approx(tr_rat_resampling_score, CAST(0.77 AS DOUBLE), 1000000)| +-------------------------------------------------------------------------+ | 17| +-------------------------------------------------------------------------+ {code} For the accuracy 1e-05, the result must be around 17-18 but not 1000. Here is percentile calculation in Google Sheets for the same input: https://docs.google.com/spreadsheets/d/1Y1i4Td6s9jZQ-bD4IRTESLXP3UxKpqJSXGtmx0Q5TA0/edit?usp=sharing -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org