Re: Spark SQL: Avoid shuffles when data is already partitioned on disk

Reynold Xin Thu, 21 Jan 2016 23:34:29 -0800

The original email was asking about data partitioning (Hive style) for
files, not in memory caching.


On Thursday, January 21, 2016, Takeshi Yamamuro <linguin....@gmail.com>
wrote:

> You mean RDD#partitions are possibly split into multiple Spark task
> partitions?
> If so, the optimization below is wrong?
>
> Without opt.:
> ####
> == Physical Plan ==
> TungstenAggregate(key=[col0#159],
> functions=[(sum(col1#160),mode=Final,isDistinct=false),(avg(col2#161),mode=Final,isDistinct=false)],
> output=[col0#159,sum(col1)#177,avg(col2)#178])
> +- TungstenAggregate(key=[col0#159],
> functions=[(sum(col1#160),mode=Partial,isDistinct=false),(avg(col2#161),mode=Partial,isDistinct=false)],
> output=[col0#159,sum#200,sum#201,count#202L])
>    +- TungstenExchange hashpartitioning(col0#159,200), None
>       +- InMemoryColumnarTableScan [col0#159,col1#160,col2#161],
> InMemoryRelation [col0#159,col1#160,col2#161], true, 10000,
> StorageLevel(true, true, false, true, 1), ConvertToUnsafe, None
>
> With opt.:
> ####
> == Physical Plan ==
> TungstenAggregate(key=[col0#159],
> functions=[(sum(col1#160),mode=Complete,isDistinct=false),(avg(col2#161),mode=Final,isDistinct=false)],
> output=[col0#159,sum(col1)#177,avg(col2)#178])
> +- TungstenExchange hashpartitioning(col0#159,200), None
>   +- InMemoryColumnarTableScan [col0#159,col1#160,col2#161],
> InMemoryRelation [col0#159,col1#160,col2#161], true, 10000,
> StorageLevel(true, true, false, true, 1), ConvertToUnsafe, None
>
>
>
> On Fri, Jan 22, 2016 at 12:13 PM, Reynold Xin <r...@databricks.com
> <javascript:_e(%7B%7D,'cvml','r...@databricks.com');>> wrote:
>
>> It is not necessary if you are using bucketing available in Spark 2.0.
>> For partitioning, it is still necessary because we do not assume each
>> partition is small, and as a result there is no guarantee all the records
>> for a partition end up in a single Spark task partition.
>>
>>
>> On Thu, Jan 21, 2016 at 3:22 AM, Justin Uang <justin.u...@gmail.com
>> <javascript:_e(%7B%7D,'cvml','justin.u...@gmail.com');>> wrote:
>>
>>> Hi,
>>>
>>> If I had a df and I wrote it out via partitionBy("id"), presumably, when
>>> I load in the df and do a groupBy("id"), a shuffle shouldn't be necessary
>>> right? Effectively, we can load in the dataframe with a hash partitioner
>>> already set, since each task can simply read all the folders where
>>> id=<value> where hash(<value>) % reducer_count == reducer_id. Is this an
>>> optimization that is on the radar? This will be a huge boon in terms of
>>> reducing the number of shuffles necessary if we're always joining on the
>>> same columns.
>>>
>>> Best,
>>>
>>> Justin
>>>
>>
>>
>
>
> --
> ---
> Takeshi Yamamuro
>

Re: Spark SQL: Avoid shuffles when data is already partitioned on disk

Reply via email to