[ 
https://issues.apache.org/jira/browse/SPARK-21577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16106855#comment-16106855
 ] 

Hyukjin Kwon edited comment on SPARK-21577 at 7/31/17 6:26 AM:
---------------------------------------------------------------

Please check out https://spark.apache.org/community.html and I guess you can 
start the thread.


was (Author: hyukjin.kwon):
Please check out https://spark.apache.org/community.html.

> Issue is handling too many aggregations 
> ----------------------------------------
>
>                 Key: SPARK-21577
>                 URL: https://issues.apache.org/jira/browse/SPARK-21577
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.6.0
>         Environment: Cloudera CDH 1.8.3
> Spark 1.6.0
>            Reporter: Kannan Subramanian
>
>  my requirement, reading the table from hive(Size - around 1.6 TB). I have to 
> do more than 200 aggregation operations mostly avg, sum and std_dev. Spark 
> application total execution time is take more than 12 hours. To Optimize the 
> code I used shuffle Partitioning and memory tuning and all. But Its 
> nothelpful for me. Please note that same query I ran in hive on map reduce. 
> MR job completion time taken around only 5 hours.  Kindly let me know is 
> there any way to optimize or efficient way of handling multiple aggregation 
> operations.    val inputDataDF = 
> hiveContext.read.parquet("/inputparquetData")    
> inputDataDF.groupBy("seq_no","year", 
> "month","radius").agg(count($"Dseq"),avg($"Emp"),avg($"Ntw"),avg($"Age"),  
> avg($"DAll"),avg($"PAll"),avg($"DSum"),avg($"dol"),sum("sl"),sum($"PA"),sum($"DS")...
>  like 200 columns)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to