Hi All, I have a use case where I need to perform nested windowing functions on a data frame to get final set of columns. Example:
w1 = Window.partitionBy('col1') df = df.withColumn('sum1', F.sum('val')) w2 = Window.partitionBy('col1', 'col2') df = df.withColumn('sum2', F.sum('val')) w3 = Window.partitionBy('col1', 'col2', 'col3') df = df.withColumn('sum3', F.sum('val')) These 3 partitions are not huge at all, however the data size is 2T parquet snappy compressed. This throws a lot of outofmemory errors. I would like to get some advice around whether nested window functions is a good idea in pyspark? I wanted to avoid using multiple filter + joins to get to the final state, as join can create crazy shuffle. Any suggestions would be appreciated! -- Regards, Rishi Shah