Thanks but the whole point is not setting it explicitly but it should be derived from its parent RDDS.
Thanks On Fri, Jun 24, 2016 at 6:09 AM, ayan guha <[email protected]> wrote: > You can change paralllism like following: > > conf = SparkConf() > conf.set('spark.sql.shuffle.partitions',10) > sc = SparkContext(conf=conf) > > > > On Fri, Jun 24, 2016 at 6:46 AM, Darshan Singh <[email protected]> > wrote: > >> Hi, >> >> My default parallelism is 100. Now I join 2 dataframes with 20 partitions >> each , joined dataframe has 100 partition. I want to know what is the way >> to keep it to 20 (except re-partition and coalesce. >> >> Also, when i join these 2 dataframes I am using 4 columns as joined >> columns. The dataframes are partitions based on first 2 columns of join and >> thus, in effect one partition should be joined corresponding joins and >> doesn't need to join with rest of partitions so why spark is shuffling all >> the data. >> >> Simialrly, when my dataframe is partitioned by col1,col2 and if i use >> group by on col1,col2,col3,col4 then why does it shuffle everything whereas >> it need to sort each partitions and then should grouping there itself. >> >> Bit confusing , I am using 1.5.1 >> >> Is it fixed in future versions. >> >> Thanks >> > > > > -- > Best Regards, > Ayan Guha >
