Hi Community, Can someone please help validate the idea below and suggest pros/cons.
Most of our jobs end up with a shuffle stage based on a partition column value before writing into parquet, and most of the time we have data skew ness in partitions. Currently most of the problems happen at shuffle read stage and we face several issues like below, 1. Executor lost 2. Node lost 3. Shuffle Fetch erros *And I have been thinking about ways to completely avoid de-serializing data during shuffle read phase and one way to be able to do it in our case is by,* 1. *Serialize the shuffle write in parquet + zstd format* 2. *Just move the data files into partition folders from shuffle blocks locally written to executors (This avoids trying to de-serialize the data into memory and disk and then write into parquet)* Please confirm on the feasibility here and any pros/cons on the above approach. Regards.