The original email was asking about data partitioning (Hive style) for files, not in memory caching.
On Thursday, January 21, 2016, Takeshi Yamamuro <linguin....@gmail.com> wrote: > You mean RDD#partitions are possibly split into multiple Spark task > partitions? > If so, the optimization below is wrong? > > Without opt.: > #### > == Physical Plan == > TungstenAggregate(key=[col0#159], > functions=[(sum(col1#160),mode=Final,isDistinct=false),(avg(col2#161),mode=Final,isDistinct=false)], > output=[col0#159,sum(col1)#177,avg(col2)#178]) > +- TungstenAggregate(key=[col0#159], > functions=[(sum(col1#160),mode=Partial,isDistinct=false),(avg(col2#161),mode=Partial,isDistinct=false)], > output=[col0#159,sum#200,sum#201,count#202L]) > +- TungstenExchange hashpartitioning(col0#159,200), None > +- InMemoryColumnarTableScan [col0#159,col1#160,col2#161], > InMemoryRelation [col0#159,col1#160,col2#161], true, 10000, > StorageLevel(true, true, false, true, 1), ConvertToUnsafe, None > > With opt.: > #### > == Physical Plan == > TungstenAggregate(key=[col0#159], > functions=[(sum(col1#160),mode=Complete,isDistinct=false),(avg(col2#161),mode=Final,isDistinct=false)], > output=[col0#159,sum(col1)#177,avg(col2)#178]) > +- TungstenExchange hashpartitioning(col0#159,200), None > +- InMemoryColumnarTableScan [col0#159,col1#160,col2#161], > InMemoryRelation [col0#159,col1#160,col2#161], true, 10000, > StorageLevel(true, true, false, true, 1), ConvertToUnsafe, None > > > > On Fri, Jan 22, 2016 at 12:13 PM, Reynold Xin <r...@databricks.com > <javascript:_e(%7B%7D,'cvml','r...@databricks.com');>> wrote: > >> It is not necessary if you are using bucketing available in Spark 2.0. >> For partitioning, it is still necessary because we do not assume each >> partition is small, and as a result there is no guarantee all the records >> for a partition end up in a single Spark task partition. >> >> >> On Thu, Jan 21, 2016 at 3:22 AM, Justin Uang <justin.u...@gmail.com >> <javascript:_e(%7B%7D,'cvml','justin.u...@gmail.com');>> wrote: >> >>> Hi, >>> >>> If I had a df and I wrote it out via partitionBy("id"), presumably, when >>> I load in the df and do a groupBy("id"), a shuffle shouldn't be necessary >>> right? Effectively, we can load in the dataframe with a hash partitioner >>> already set, since each task can simply read all the folders where >>> id=<value> where hash(<value>) % reducer_count == reducer_id. Is this an >>> optimization that is on the radar? This will be a huge boon in terms of >>> reducing the number of shuffles necessary if we're always joining on the >>> same columns. >>> >>> Best, >>> >>> Justin >>> >> >> > > > -- > --- > Takeshi Yamamuro >