[ https://issues.apache.org/jira/browse/SPARK-27336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16806947#comment-16806947 ]
Chakravarthi commented on SPARK-27336: -------------------------------------- I'm checking this issue. > Incorrect DataSet.summary() result > ---------------------------------- > > Key: SPARK-27336 > URL: https://issues.apache.org/jira/browse/SPARK-27336 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 3.0.0 > Reporter: Gengliang Wang > Priority: Major > Attachments: test.csv > > > There is a single data point in the minimum_nights column that is 1.0E8 out > of 8k records, but .summary() says it is the 75% and the max. > I compared this with approxQuantile, and approxQuantile for 75% gave the > correct value of 30.0. > To reproduce: > {code:java} > scala> val df = > spark.read.format("csv").load("test.csv").withColumn("minimum_nights", > '_c0.cast("Int")) > df: org.apache.spark.sql.DataFrame = [_c0: string, minimum_nights: int] > scala> df.select("minimum_nights").summary().show() > +-------+------------------+ > |summary| minimum_nights| > +-------+------------------+ > | count| 7072| > | mean| 14156.35407239819| > | stddev|1189128.5444975856| > | min| 1| > | 25%| 2| > | 50%| 4| > | 75%| 100000000| > | max| 100000000| > +-------+------------------+ > scala> df.stat.approxQuantile("minimum_nights", Array(0.75), 0.1) > res1: Array[Double] = Array(30.0) > scala> df.stat.approxQuantile("minimum_nights", Array(0.75), 0.001) > res2: Array[Double] = Array(30.0) > scala> df.stat.approxQuantile("minimum_nights", Array(0.75), 0.0001) > res3: Array[Double] = Array(1.0E8) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org