Re: Partitioning in spark

Darshan Singh Fri, 24 Jun 2016 00:15:22 -0700

Thanks but the whole point is not setting it explicitly but it should be
derived from its parent RDDS.


Thanks

On Fri, Jun 24, 2016 at 6:09 AM, ayan guha <[email protected]> wrote:

> You can change paralllism like following:
>
> conf = SparkConf()
> conf.set('spark.sql.shuffle.partitions',10)
> sc = SparkContext(conf=conf)
>
>
>
> On Fri, Jun 24, 2016 at 6:46 AM, Darshan Singh <[email protected]>
> wrote:
>
>> Hi,
>>
>> My default parallelism is 100. Now I join 2 dataframes with 20 partitions
>> each , joined dataframe has 100 partition. I want to know what is the way
>> to keep it to 20 (except re-partition and coalesce.
>>
>> Also, when i join these 2 dataframes I am using 4 columns as joined
>> columns. The dataframes are partitions based on first 2 columns of join and
>> thus, in effect one partition should be joined corresponding joins and
>> doesn't need to join with rest of partitions so why spark is shuffling all
>> the data.
>>
>> Simialrly, when my dataframe is partitioned by col1,col2 and if i use
>> group by on col1,col2,col3,col4 then why does it shuffle everything whereas
>> it need to sort each partitions and then should grouping there itself.
>>
>> Bit confusing , I am using 1.5.1
>>
>> Is it fixed in future versions.
>>
>> Thanks
>>
>
>
>
> --
> Best Regards,
> Ayan Guha
>

Re: Partitioning in spark

Reply via email to