You mean RDD#partitions are possibly split into multiple Spark task partitions? If so, the optimization below is wrong?
Without opt.: #### == Physical Plan == TungstenAggregate(key=[col0#159], functions=[(sum(col1#160),mode=Final,isDistinct=false),(avg(col2#161),mode=Final,isDistinct=false)], output=[col0#159,sum(col1)#177,avg(col2)#178]) +- TungstenAggregate(key=[col0#159], functions=[(sum(col1#160),mode=Partial,isDistinct=false),(avg(col2#161),mode=Partial,isDistinct=false)], output=[col0#159,sum#200,sum#201,count#202L]) +- TungstenExchange hashpartitioning(col0#159,200), None +- InMemoryColumnarTableScan [col0#159,col1#160,col2#161], InMemoryRelation [col0#159,col1#160,col2#161], true, 10000, StorageLevel(true, true, false, true, 1), ConvertToUnsafe, None With opt.: #### == Physical Plan == TungstenAggregate(key=[col0#159], functions=[(sum(col1#160),mode=Complete,isDistinct=false),(avg(col2#161),mode=Final,isDistinct=false)], output=[col0#159,sum(col1)#177,avg(col2)#178]) +- TungstenExchange hashpartitioning(col0#159,200), None +- InMemoryColumnarTableScan [col0#159,col1#160,col2#161], InMemoryRelation [col0#159,col1#160,col2#161], true, 10000, StorageLevel(true, true, false, true, 1), ConvertToUnsafe, None On Fri, Jan 22, 2016 at 12:13 PM, Reynold Xin <r...@databricks.com> wrote: > It is not necessary if you are using bucketing available in Spark 2.0. For > partitioning, it is still necessary because we do not assume each partition > is small, and as a result there is no guarantee all the records for a > partition end up in a single Spark task partition. > > > On Thu, Jan 21, 2016 at 3:22 AM, Justin Uang <justin.u...@gmail.com> > wrote: > >> Hi, >> >> If I had a df and I wrote it out via partitionBy("id"), presumably, when >> I load in the df and do a groupBy("id"), a shuffle shouldn't be necessary >> right? Effectively, we can load in the dataframe with a hash partitioner >> already set, since each task can simply read all the folders where >> id=<value> where hash(<value>) % reducer_count == reducer_id. Is this an >> optimization that is on the radar? This will be a huge boon in terms of >> reducing the number of shuffles necessary if we're always joining on the >> same columns. >> >> Best, >> >> Justin >> > > -- --- Takeshi Yamamuro