[ https://issues.apache.org/jira/browse/SPARK-24030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16447626#comment-16447626 ]
Takeshi Yamamuro commented on SPARK-24030: ------------------------------------------ I quickly tried this at least in the master and v2.3.0 though, I couldn't reproduce: {code:java} ./bin/spark-shell --master=local[1] --conf spark.driver.memory=4g --conf spark.sql.shuffle.partitions=1 -v scala> :paste def timer[R](f: => {}): Unit = { val count = 5 val iters = (0 until count).map { i => val t0 = System.nanoTime() f val t1 = System.nanoTime() val elapsed = t1 - t0 + 0.0 println(s"#$i: ${elapsed / 1000000000.0}") elapsed } println("Avg. Elapsed Time: " + ((iters.sum / count) / 1000000000.0) + "s") } scala> timer { spark.range(1060000).selectExpr("percentile_approx(id, 0.5)").collect } #0: 4.405557999 #1: 0.40483767 #2: 0.407931124 #3: 0.424493487 #4: 0.386281957 Avg. Elapsed Time: 1.2058204474s scala> timer { spark.range(1040000).selectExpr("percentile_approx(id, 0.5)").collect } #0: 4.560478621 #1: 0.387799115 #2: 0.38196225 #3: 0.377551809 #4: 0.390596532 Avg. Elapsed Time: 1.2196776654s {code} > SparkSQL percentile_approx function is too slow for over 1,060,000 records. > --------------------------------------------------------------------------- > > Key: SPARK-24030 > URL: https://issues.apache.org/jira/browse/SPARK-24030 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.2.1 > Environment: zeppline + Spark 2.2.1 on Amazon EMR and local laptop. > Reporter: Seok-Joon,Yun > Priority: Major > Attachments: screenshot_2018-04-20 23.15.02.png > > > I used percentile_approx functions for over 1,060,000 records. It is too > slow. It takes about 90 mins. So I tried for 1,040,000 records. It take about > 10 secs. > I tested for data reading on JDBC and parquet. It takes same time lengths. > I wonder that function is not designed for multi worker. > I looked gangglia and spark history. It worked on one worker. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org