Groupby in fast in Impala than spark sql - any suggestions

KhajaAsmath Mohammed Tue, 28 Mar 2017 07:36:12 -0700

Hi,

I am working on requirement where i need to join two tables and do group by
to get max value on some fileds.


Table1: 10 GB of data
Table2: 96 GB of data

Same query in Impala is taking around 20 miniutes and it took almost 3
hours to run in spark sql.

I have added repartition to dataframe, persist as memory and disk still
response is very bad. any suggetions.

val results_group_dataframe=sqlContext.sql("SELECT
a.originalsamplingstate,a.vin,max(a.utctime) as geospatialtime FROM
GeoSpatialTemp A GROUP BY a.VIN,
a.OriginalSamplingState").repartition(numPartitions)

Thanks,

Asmath

Groupby in fast in Impala than spark sql - any suggestions

Reply via email to