[jira] [Resolved] (SPARK-25125) Spark SQL percentile_approx takes longer than Hive version for large datasets

Hyukjin Kwon (Jira) Mon, 07 Oct 2019 22:45:58 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-25125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Hyukjin Kwon resolved SPARK-25125.
----------------------------------
    Resolution: Incomplete

> Spark SQL percentile_approx takes longer than Hive version for large datasets
> -----------------------------------------------------------------------------
>
>                 Key: SPARK-25125
>                 URL: https://issues.apache.org/jira/browse/SPARK-25125
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.3.1
>            Reporter: Mir Ali
>            Priority: Major
>              Labels: bulk-closed
>
> The percentile_approx function in Spark SQL takes much longer than the 
> previous Hive implementation for large data sets (7B rows grouped into 200k 
> buckets, percentile is on each bucket). Tested with Spark 2.3.1 vs Spark 
> 2.1.0.
> The below code finishes in around 24 minutes on spark 2.1.0, on spark 2.3.1, 
> this does not finish at all in more than 2 hours. Also tried this with 
> different accuracy values 5000,1000,500, the timing does get better with 
> smaller datasets with the new version, but the speed difference is 
> insignificant
>  
> Infrastructure used:
> AWS EMR -> Spark 2.1.0
> vs
> AWS EMR  -> Spark 2.3.1
>  
> spark-shell --conf spark.driver.memory=12g --conf spark.executor.memory=10g 
> --conf spark.sql.shuffle.partitions=2000 --conf 
> spark.default.parallelism=2000 --num-executors=75 --executor-cores=2
> {code:java}
> import org.apache.spark.sql.functions._ 
> import org.apache.spark.sql.types._ 
> val df=spark.range(7000000000L).withColumn("some_grouping_id", 
> round(rand()*200000L).cast(LongType)) 
> df.createOrReplaceTempView("tab")   
> val percentile_query = """ select some_grouping_id, percentile_approx(id, 
> array(0,0.25,0.5,0.75,1)) from tab group by some_grouping_id """ 
> spark.sql(percentile_query).collect()
> {code}
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Resolved] (SPARK-25125) Spark SQL percentile_approx takes longer than Hive version for large datasets

Reply via email to