Yep, the Spark UI's Shuffle Write Size/Records" column can sometimes show incorrect record counts *when data is retrieved from cache or persisted data*. This happens because the record count reflects the number of records written to disk for shuffling, and not the actual number of records in the cached or persisted data itself. Add to it, because of lazy evaluation:, Spark may only materialize a portion of the cached or persisted data when a task needs it. The "Shuffle Write Size/Records" might only reflect the materialized portion, not the total number of records in the cache/persistence. While the "Shuffle Write Size/Records" might be inaccurate for cached/persisted data, the "Shuffle Read Size/Records" column can be more reliable. This metric shows the number of records read from shuffle by the following stage, which should be closer to the actual number of records processed.
HTH Mich Talebzadeh, Technologist | Architect | Data Engineer | Generative AI | FinCrime London United Kingdom view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> https://en.everybodywiki.com/Mich_Talebzadeh *Disclaimer:* The information provided is correct to the best of my knowledge but of course cannot be guaranteed . It is essential to note that, as with any advice, quote "one test result is worth one-thousand expert opinions (Werner <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)". On Thu, 23 May 2024 at 17:45, Prem Sahoo <prem.re...@gmail.com> wrote: > Hello Team, > in spark DAG UI , we have Stages tab. Once you click on each stage you can > view the tasks. > > In each task we have a column "ShuffleWrite Size/Records " that column > prints wrong data when it gets the data from cache/persist . it > typically will show the wrong record number though the data size is correct > for e.g 3.2G/ 7400 which is wrong . > > please advise. >