Hi,

I am working on requirement where i need to join two tables and do group by
to get max value on some fileds.

Table1: 10 GB of data
Table2: 96 GB of data

Same query in Impala is taking around 20 miniutes and it took almost 3
hours to run in spark sql.

I have added repartition to dataframe, persist as memory and disk still
response is very bad. any suggetions.

val results_group_dataframe=sqlContext.sql("SELECT
a.originalsamplingstate,a.vin,max(a.utctime) as geospatialtime FROM
GeoSpatialTemp A GROUP BY a.VIN,
a.OriginalSamplingState").repartition(numPartitions)

Thanks,

Asmath

Reply via email to