RE: forcing dataframe groupby partitioning

Mendelson, Assaf Sun, 29 Jan 2017 10:02:56 -0800

Could you explain why this would work?
Assaf.

From: Haviv, Daniel [mailto:dha...@amazon.com]
Sent: Sunday, January 29, 2017 7:09 PM
To: Mendelson, Assaf
Cc: user@spark.apache.org
Subject: Re: forcing dataframe groupby partitioning


If there's no built in local groupBy, You could do something like that:
df.groupby(C1,C2).agg(...).flatmap(x=>x.groupBy(C1)).agg

Thank you.
Daniel

On 29 Jan 2017, at 18:33, Mendelson, Assaf 
<assaf.mendel...@rsa.com<mailto:assaf.mendel...@rsa.com>> wrote:
Hi,

Consider the following example:

    df.groupby(C1,C2).agg(some agg).groupby(C1).agg(some more agg)

The default way spark would behave would be to shuffle according to a 
combination of C1 and C2 and then shuffle again by C1 only.
This behavior makes sense when one uses C2 to salt C1 for skewed data, however, 
when this is part of the logic the cost of doing two shuffles instead of one is 
not insignificant (even worse when multiple iterations are done).

Is there a way to force the first groupby to shuffle the data so that the 
second groupby would occur within the partition?

Thanks,
                Assaf.

RE: forcing dataframe groupby partitioning

Reply via email to