Hi,

... Most of our jobs end up with a shuffle stage based on a partition
column value before writing into a parquet, and most of the time we have
data skewness in partitions....

Have you considered the causes of these recurring issues and some potential
alternative strategies?

   1.

   - Tuning Spark Configuration related to shuffle operations, settings
   like *adjusting the* *spark.shuffle.partitions**,*
*spark.reducer.maxSizeInFlight,
   spark.shuffle.memory.fraction spark.shuffle.spi*ll etc
   2.

   - Partitioning Strategy: may benefit to review and optimize the
   partitioning strategy to minimize data skewness by looking at causes of
   skewness
   3.

   SELECT column_name, COUNT(column_name) AS count FROM ABC GROUP BY
   column_name ORDER BY count DESC
   4.

   Then you can try things like salting or bucketing to distribute data
   more evenly.
   5.

   - Caching Frequently Accessed Data: If certain data is frequently
   accessed, you may consider caching it in memory to reduce the need for
   repeated shuffling.

The feasibility of your proposal depends on the specific requirements,
characteristics of your data, and the downstream processes that consume
that data. If downstream tools or processes expect data in a specific
format, the serialized format may require additional processing or
conversion, impacting compatibility.

HTH

Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Wed, 7 Feb 2024 at 18:59, satyajit vegesna <satyajit.apas...@gmail.com>
wrote:

> Hi Community,
>
> Can someone please help validate the idea below and suggest pros/cons.
>
> Most of our jobs end up with a shuffle stage based on a partition column
> value before writing into parquet, and most of the time we have data skew
> ness in partitions.
>
> Currently most of the problems happen at shuffle read stage and we face
> several issues like below,
>
>    1. Executor lost
>    2. Node lost
>    3. Shuffle Fetch erros
>
> *And I have been thinking about ways to completely avoid de-serializing
> data during shuffle read phase and one way to be able to do it in our case
> is by,*
>
>    1. *Serialize the shuffle write in parquet + zstd format*
>    2. *Just move the data files into partition folders from shuffle
>    blocks locally written to executors  (This avoids trying to de-serialize
>    the data into memory and disk and then write into parquet)*
>
> Please confirm on the feasibility here and any pros/cons on the above
> approach.
>
> Regards.
>
>
>
>

Reply via email to