[GitHub] [spark] Deependra-Patel commented on pull request #42071: [SPARK-44209] Expose amount of shuffle data available on the node

via GitHub Tue, 25 Jul 2023 03:17:55 -0700


Deependra-Patel commented on PR #42071:
URL: https://github.com/apache/spark/pull/42071#issuecomment-1649543098


   > This is going to be very expensive to compute, and has nontrivial 
performance impact (particularly on a service which 
   tends to be already loaded and critical).
   
   We have been running 1 TB TPC-H/DS benchmarks daily with these changes and 
haven't seen any difference in performance. We are only listing the files and 
checking their length not actually reading the data itself, therefore shouldn't 
be that expense. Also, we do it every 30s.
   
   
   > Exposing this from executors, as part of data gen and aggregating per node 
would be much more cheaper
   
   Executors can be stopped when dynamic allocation is used, then we won't be 
able to read this metric from executor for aggregation. We will also have stale 
metric if ESS polls regularly before executor goes away. The scenario when 
there is no shuffle data on the node and get this metric value 0 is important 
to actually remove the node.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] Deependra-Patel commented on pull request #42071: [SPARK-44209] Expose amount of shuffle data available on the node

Reply via email to