Hi, ... Most of our jobs end up with a shuffle stage based on a partition column value before writing into a parquet, and most of the time we have data skewness in partitions....
Have you considered the causes of these recurring issues and some potential alternative strategies? 1. - Tuning Spark Configuration related to shuffle operations, settings like *adjusting the* *spark.shuffle.partitions**,* *spark.reducer.maxSizeInFlight, spark.shuffle.memory.fraction spark.shuffle.spi*ll etc 2. - Partitioning Strategy: may benefit to review and optimize the partitioning strategy to minimize data skewness by looking at causes of skewness 3. SELECT column_name, COUNT(column_name) AS count FROM ABC GROUP BY column_name ORDER BY count DESC 4. Then you can try things like salting or bucketing to distribute data more evenly. 5. - Caching Frequently Accessed Data: If certain data is frequently accessed, you may consider caching it in memory to reduce the need for repeated shuffling. The feasibility of your proposal depends on the specific requirements, characteristics of your data, and the downstream processes that consume that data. If downstream tools or processes expect data in a specific format, the serialized format may require additional processing or conversion, impacting compatibility. HTH Mich Talebzadeh, Dad | Technologist | Solutions Architect | Engineer London United Kingdom view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> https://en.everybodywiki.com/Mich_Talebzadeh *Disclaimer:* Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction. On Wed, 7 Feb 2024 at 18:59, satyajit vegesna <satyajit.apas...@gmail.com> wrote: > Hi Community, > > Can someone please help validate the idea below and suggest pros/cons. > > Most of our jobs end up with a shuffle stage based on a partition column > value before writing into parquet, and most of the time we have data skew > ness in partitions. > > Currently most of the problems happen at shuffle read stage and we face > several issues like below, > > 1. Executor lost > 2. Node lost > 3. Shuffle Fetch erros > > *And I have been thinking about ways to completely avoid de-serializing > data during shuffle read phase and one way to be able to do it in our case > is by,* > > 1. *Serialize the shuffle write in parquet + zstd format* > 2. *Just move the data files into partition folders from shuffle > blocks locally written to executors (This avoids trying to de-serialize > the data into memory and disk and then write into parquet)* > > Please confirm on the feasibility here and any pros/cons on the above > approach. > > Regards. > > > >