[ https://issues.apache.org/jira/browse/SPARK-24030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16445103#comment-16445103 ]
Hyukjin Kwon commented on SPARK-24030: -------------------------------------- It should be easier to debug and find the cause if there're a simple reproducer if possible, logs or screen captures for the UI. > SparkSQL percentile_approx function is too slow for over 1,060,000 records. > --------------------------------------------------------------------------- > > Key: SPARK-24030 > URL: https://issues.apache.org/jira/browse/SPARK-24030 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.2.1 > Environment: zeppline + Spark 2.2.1 on Amazon EMR and local laptop. > Reporter: Seok-Joon,Yun > Priority: Major > > I used percentile_approx functions for over 1,060,000 records. It is too > slow. It takes about 90 mins. So I tried for 1,040,000 records. It take about > 10 secs. > I tested for data reading on JDBC and parquet. It takes same time lengths. > I wonder that function is not designed for multi worker. > I looked gangglia and spark history. It worked on one worker. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org