how long does it take if you remove the repartition and just collect the
result? I don't think repartition is needed here. There's already a shuffle
for group by

On Tue, Mar 28, 2017 at 10:35 PM, KhajaAsmath Mohammed <
mdkhajaasm...@gmail.com> wrote:

> Hi,
>
> I am working on requirement where i need to join two tables and do group
> by to get max value on some fileds.
>
> Table1: 10 GB of data
> Table2: 96 GB of data
>
> Same query in Impala is taking around 20 miniutes and it took almost 3
> hours to run in spark sql.
>
> I have added repartition to dataframe, persist as memory and disk still
> response is very bad. any suggetions.
>
> val results_group_dataframe=sqlContext.sql("SELECT 
> a.originalsamplingstate,a.vin,max(a.utctime) as geospatialtime FROM 
> GeoSpatialTemp A GROUP BY a.VIN, 
> a.OriginalSamplingState").repartition(numPartitions)
>
> Thanks,
>
> Asmath
>
>

Reply via email to