It is not necessary if you are using bucketing available in Spark 2.0. For
partitioning, it is still necessary because we do not assume each partition
is small, and as a result there is no guarantee all the records for a
partition end up in a single Spark task partition.


On Thu, Jan 21, 2016 at 3:22 AM, Justin Uang <justin.u...@gmail.com> wrote:

> Hi,
>
> If I had a df and I wrote it out via partitionBy("id"), presumably, when I
> load in the df and do a groupBy("id"), a shuffle shouldn't be necessary
> right? Effectively, we can load in the dataframe with a hash partitioner
> already set, since each task can simply read all the folders where
> id=<value> where hash(<value>) % reducer_count == reducer_id. Is this an
> optimization that is on the radar? This will be a huge boon in terms of
> reducing the number of shuffles necessary if we're always joining on the
> same columns.
>
> Best,
>
> Justin
>

Reply via email to