Re: Spark SQL: Avoid shuffles when data is already partitioned on disk

Takeshi Yamamuro Fri, 22 Jan 2016 01:58:27 -0800

My bad, thanks.

On Fri, Jan 22, 2016 at 4:34 PM, Reynold Xin <r...@databricks.com> wrote:


> The original email was asking about data partitioning (Hive style) for
> files, not in memory caching.
>
>
> On Thursday, January 21, 2016, Takeshi Yamamuro <linguin....@gmail.com>
> wrote:
>
>> You mean RDD#partitions are possibly split into multiple Spark task
>> partitions?
>> If so, the optimization below is wrong?
>>
>> Without opt.:
>> ####
>> == Physical Plan ==
>> TungstenAggregate(key=[col0#159],
>> functions=[(sum(col1#160),mode=Final,isDistinct=false),(avg(col2#161),mode=Final,isDistinct=false)],
>> output=[col0#159,sum(col1)#177,avg(col2)#178])
>> +- TungstenAggregate(key=[col0#159],
>> functions=[(sum(col1#160),mode=Partial,isDistinct=false),(avg(col2#161),mode=Partial,isDistinct=false)],
>> output=[col0#159,sum#200,sum#201,count#202L])
>>    +- TungstenExchange hashpartitioning(col0#159,200), None
>>       +- InMemoryColumnarTableScan [col0#159,col1#160,col2#161],
>> InMemoryRelation [col0#159,col1#160,col2#161], true, 10000,
>> StorageLevel(true, true, false, true, 1), ConvertToUnsafe, None
>>
>> With opt.:
>> ####
>> == Physical Plan ==
>> TungstenAggregate(key=[col0#159],
>> functions=[(sum(col1#160),mode=Complete,isDistinct=false),(avg(col2#161),mode=Final,isDistinct=false)],
>> output=[col0#159,sum(col1)#177,avg(col2)#178])
>> +- TungstenExchange hashpartitioning(col0#159,200), None
>>   +- InMemoryColumnarTableScan [col0#159,col1#160,col2#161],
>> InMemoryRelation [col0#159,col1#160,col2#161], true, 10000,
>> StorageLevel(true, true, false, true, 1), ConvertToUnsafe, None
>>
>>
>>
>> On Fri, Jan 22, 2016 at 12:13 PM, Reynold Xin <r...@databricks.com>
>> wrote:
>>
>>> It is not necessary if you are using bucketing available in Spark 2.0.
>>> For partitioning, it is still necessary because we do not assume each
>>> partition is small, and as a result there is no guarantee all the records
>>> for a partition end up in a single Spark task partition.
>>>
>>>
>>> On Thu, Jan 21, 2016 at 3:22 AM, Justin Uang <justin.u...@gmail.com>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> If I had a df and I wrote it out via partitionBy("id"), presumably,
>>>> when I load in the df and do a groupBy("id"), a shuffle shouldn't be
>>>> necessary right? Effectively, we can load in the dataframe with a hash
>>>> partitioner already set, since each task can simply read all the folders
>>>> where id=<value> where hash(<value>) % reducer_count == reducer_id. Is this
>>>> an optimization that is on the radar? This will be a huge boon in terms of
>>>> reducing the number of shuffles necessary if we're always joining on the
>>>> same columns.
>>>>
>>>> Best,
>>>>
>>>> Justin
>>>>
>>>
>>>
>>
>>
>> --
>> ---
>> Takeshi Yamamuro
>>
>


-- 
---
Takeshi Yamamuro

Re: Spark SQL: Avoid shuffles when data is already partitioned on disk

Reply via email to