Deependra-Patel commented on PR #42071: URL: https://github.com/apache/spark/pull/42071#issuecomment-1649543098
> This is going to be very expensive to compute, and has nontrivial performance impact (particularly on a service which tends to be already loaded and critical). We have been running 1 TB TPC-H/DS benchmarks daily with these changes and haven't seen any difference in performance. We are only listing the files and checking their length not actually reading the data itself, therefore shouldn't be that expense. Also, we do it every 30s. > Exposing this from executors, as part of data gen and aggregating per node would be much more cheaper Executors can be stopped when dynamic allocation is used, then we won't be able to read this metric from executor for aggregation. We will also have stale metric if ESS polls regularly before executor goes away. The scenario when there is no shuffle data on the node and get this metric value 0 is important to actually remove the node. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
