Hi, I am working on requirement where i need to join two tables and do group by to get max value on some fileds.
Table1: 10 GB of data Table2: 96 GB of data Same query in Impala is taking around 20 miniutes and it took almost 3 hours to run in spark sql. I have added repartition to dataframe, persist as memory and disk still response is very bad. any suggetions. val results_group_dataframe=sqlContext.sql("SELECT a.originalsamplingstate,a.vin,max(a.utctime) as geospatialtime FROM GeoSpatialTemp A GROUP BY a.VIN, a.OriginalSamplingState").repartition(numPartitions) Thanks, Asmath