[pyspark 2.4.3] nested windows function performance

Rishi Shah Sat, 19 Oct 2019 21:58:27 -0700

Hi All,

I have a use case where I need to perform nested windowing functions on a
data frame to get final set of columns. Example:


w1 = Window.partitionBy('col1')
df = df.withColumn('sum1', F.sum('val'))

w2 = Window.partitionBy('col1', 'col2')
df = df.withColumn('sum2', F.sum('val'))

w3 = Window.partitionBy('col1', 'col2', 'col3')
df = df.withColumn('sum3', F.sum('val'))

These 3 partitions are not huge at all, however the data size is 2T parquet
snappy compressed. This throws a lot of outofmemory errors.

I would like to get some advice around whether nested window functions is a
good idea in pyspark? I wanted to avoid using multiple filter + joins to
get to the final state, as join can create crazy shuffle.

Any suggestions would be appreciated!

-- 
Regards,

Rishi Shah

[pyspark 2.4.3] nested windows function performance

Reply via email to