how long does it take if you remove the repartition and just collect the result? I don't think repartition is needed here. There's already a shuffle for group by
On Tue, Mar 28, 2017 at 10:35 PM, KhajaAsmath Mohammed < mdkhajaasm...@gmail.com> wrote: > Hi, > > I am working on requirement where i need to join two tables and do group > by to get max value on some fileds. > > Table1: 10 GB of data > Table2: 96 GB of data > > Same query in Impala is taking around 20 miniutes and it took almost 3 > hours to run in spark sql. > > I have added repartition to dataframe, persist as memory and disk still > response is very bad. any suggetions. > > val results_group_dataframe=sqlContext.sql("SELECT > a.originalsamplingstate,a.vin,max(a.utctime) as geospatialtime FROM > GeoSpatialTemp A GROUP BY a.VIN, > a.OriginalSamplingState").repartition(numPartitions) > > Thanks, > > Asmath > >