[
https://issues.apache.org/jira/browse/SPARK-25125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Hyukjin Kwon resolved SPARK-25125.
----------------------------------
Resolution: Incomplete
> Spark SQL percentile_approx takes longer than Hive version for large datasets
> -----------------------------------------------------------------------------
>
> Key: SPARK-25125
> URL: https://issues.apache.org/jira/browse/SPARK-25125
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 2.3.1
> Reporter: Mir Ali
> Priority: Major
> Labels: bulk-closed
>
> The percentile_approx function in Spark SQL takes much longer than the
> previous Hive implementation for large data sets (7B rows grouped into 200k
> buckets, percentile is on each bucket). Tested with Spark 2.3.1 vs Spark
> 2.1.0.
> The below code finishes in around 24 minutes on spark 2.1.0, on spark 2.3.1,
> this does not finish at all in more than 2 hours. Also tried this with
> different accuracy values 5000,1000,500, the timing does get better with
> smaller datasets with the new version, but the speed difference is
> insignificant
>
> Infrastructure used:
> AWS EMR -> Spark 2.1.0
> vs
> AWS EMR -> Spark 2.3.1
>
> spark-shell --conf spark.driver.memory=12g --conf spark.executor.memory=10g
> --conf spark.sql.shuffle.partitions=2000 --conf
> spark.default.parallelism=2000 --num-executors=75 --executor-cores=2
> {code:java}
> import org.apache.spark.sql.functions._
> import org.apache.spark.sql.types._
> val df=spark.range(7000000000L).withColumn("some_grouping_id",
> round(rand()*200000L).cast(LongType))
> df.createOrReplaceTempView("tab")
> val percentile_query = """ select some_grouping_id, percentile_approx(id,
> array(0,0.25,0.5,0.75,1)) from tab group by some_grouping_id """
> spark.sql(percentile_query).collect()
> {code}
>
>
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]