My bad, thanks. On Fri, Jan 22, 2016 at 4:34 PM, Reynold Xin <r...@databricks.com> wrote:
> The original email was asking about data partitioning (Hive style) for > files, not in memory caching. > > > On Thursday, January 21, 2016, Takeshi Yamamuro <linguin....@gmail.com> > wrote: > >> You mean RDD#partitions are possibly split into multiple Spark task >> partitions? >> If so, the optimization below is wrong? >> >> Without opt.: >> #### >> == Physical Plan == >> TungstenAggregate(key=[col0#159], >> functions=[(sum(col1#160),mode=Final,isDistinct=false),(avg(col2#161),mode=Final,isDistinct=false)], >> output=[col0#159,sum(col1)#177,avg(col2)#178]) >> +- TungstenAggregate(key=[col0#159], >> functions=[(sum(col1#160),mode=Partial,isDistinct=false),(avg(col2#161),mode=Partial,isDistinct=false)], >> output=[col0#159,sum#200,sum#201,count#202L]) >> +- TungstenExchange hashpartitioning(col0#159,200), None >> +- InMemoryColumnarTableScan [col0#159,col1#160,col2#161], >> InMemoryRelation [col0#159,col1#160,col2#161], true, 10000, >> StorageLevel(true, true, false, true, 1), ConvertToUnsafe, None >> >> With opt.: >> #### >> == Physical Plan == >> TungstenAggregate(key=[col0#159], >> functions=[(sum(col1#160),mode=Complete,isDistinct=false),(avg(col2#161),mode=Final,isDistinct=false)], >> output=[col0#159,sum(col1)#177,avg(col2)#178]) >> +- TungstenExchange hashpartitioning(col0#159,200), None >> +- InMemoryColumnarTableScan [col0#159,col1#160,col2#161], >> InMemoryRelation [col0#159,col1#160,col2#161], true, 10000, >> StorageLevel(true, true, false, true, 1), ConvertToUnsafe, None >> >> >> >> On Fri, Jan 22, 2016 at 12:13 PM, Reynold Xin <r...@databricks.com> >> wrote: >> >>> It is not necessary if you are using bucketing available in Spark 2.0. >>> For partitioning, it is still necessary because we do not assume each >>> partition is small, and as a result there is no guarantee all the records >>> for a partition end up in a single Spark task partition. >>> >>> >>> On Thu, Jan 21, 2016 at 3:22 AM, Justin Uang <justin.u...@gmail.com> >>> wrote: >>> >>>> Hi, >>>> >>>> If I had a df and I wrote it out via partitionBy("id"), presumably, >>>> when I load in the df and do a groupBy("id"), a shuffle shouldn't be >>>> necessary right? Effectively, we can load in the dataframe with a hash >>>> partitioner already set, since each task can simply read all the folders >>>> where id=<value> where hash(<value>) % reducer_count == reducer_id. Is this >>>> an optimization that is on the radar? This will be a huge boon in terms of >>>> reducing the number of shuffles necessary if we're always joining on the >>>> same columns. >>>> >>>> Best, >>>> >>>> Justin >>>> >>> >>> >> >> >> -- >> --- >> Takeshi Yamamuro >> > -- --- Takeshi Yamamuro