Re: Spark SQL: Avoid shuffles when data is already partitioned on disk

Takeshi Yamamuro Thu, 21 Jan 2016 22:55:13 -0800

You mean RDD#partitions are possibly split into multiple Spark task
partitions?
If so, the optimization below is wrong?


Without opt.:
####
== Physical Plan ==
TungstenAggregate(key=[col0#159],
functions=[(sum(col1#160),mode=Final,isDistinct=false),(avg(col2#161),mode=Final,isDistinct=false)],
output=[col0#159,sum(col1)#177,avg(col2)#178])
+- TungstenAggregate(key=[col0#159],
functions=[(sum(col1#160),mode=Partial,isDistinct=false),(avg(col2#161),mode=Partial,isDistinct=false)],
output=[col0#159,sum#200,sum#201,count#202L])
   +- TungstenExchange hashpartitioning(col0#159,200), None
      +- InMemoryColumnarTableScan [col0#159,col1#160,col2#161],
InMemoryRelation [col0#159,col1#160,col2#161], true, 10000,
StorageLevel(true, true, false, true, 1), ConvertToUnsafe, None

With opt.:
####
== Physical Plan ==
TungstenAggregate(key=[col0#159],
functions=[(sum(col1#160),mode=Complete,isDistinct=false),(avg(col2#161),mode=Final,isDistinct=false)],
output=[col0#159,sum(col1)#177,avg(col2)#178])
+- TungstenExchange hashpartitioning(col0#159,200), None
  +- InMemoryColumnarTableScan [col0#159,col1#160,col2#161],
InMemoryRelation [col0#159,col1#160,col2#161], true, 10000,
StorageLevel(true, true, false, true, 1), ConvertToUnsafe, None



On Fri, Jan 22, 2016 at 12:13 PM, Reynold Xin <r...@databricks.com> wrote:

> It is not necessary if you are using bucketing available in Spark 2.0. For
> partitioning, it is still necessary because we do not assume each partition
> is small, and as a result there is no guarantee all the records for a
> partition end up in a single Spark task partition.
>
>
> On Thu, Jan 21, 2016 at 3:22 AM, Justin Uang <justin.u...@gmail.com>
> wrote:
>
>> Hi,
>>
>> If I had a df and I wrote it out via partitionBy("id"), presumably, when
>> I load in the df and do a groupBy("id"), a shuffle shouldn't be necessary
>> right? Effectively, we can load in the dataframe with a hash partitioner
>> already set, since each task can simply read all the folders where
>> id=<value> where hash(<value>) % reducer_count == reducer_id. Is this an
>> optimization that is on the radar? This will be a huge boon in terms of
>> reducing the number of shuffles necessary if we're always joining on the
>> same columns.
>>
>> Best,
>>
>> Justin
>>
>
>


-- 
---
Takeshi Yamamuro

Re: Spark SQL: Avoid shuffles when data is already partitioned on disk

Reply via email to