JoshRosen opened a new pull request, #39770: URL: https://github.com/apache/spark/pull/39770
### What changes were proposed in this pull request? This change updates `JsonProtocol` to add logic to exclude the "Task Executor Metrics" field from SparkListenerTaskEnd events in cases where all metric values are zero. ### Why are the changes needed? This is done to save space from event logs when Spark runs under its default out-of-the-box configuration and tasks are shorter than the executor hearbeat interval. [SPARK-26329](https://issues.apache.org/jira/browse/SPARK-26329) added "Task Executor Metrics" to JsonProtocol SparkListenerTaskEnd JSON. With the default `spark.executor.metrics.pollingInterval = 0` configuration these metric values are only updated when heartbeats occur. If a task launches and finishes between executor heartbeats then all of the "Task Executor Metrics" values will be zero. For jobs with large numbers of short tasks, this contributes to significant event log bloat. JsonProtocol already knows how to handle the absence of the "Task Executor Metrics" field, so I think it's safe for us to omit this field when all values are zero. There is a possibility that third-party code which directly consumes Spark event logs might be relying on the presence of this field. As an "escape-hatch" to avoid breaking such workloads, I have introduced a `spark.eventLog.includeAllZeroTaskExecutorMetrics` (default `false`) which can be set to `true` to restore the old behavior. ### Does this PR introduce _any_ user-facing change? No user-facing changes in history server. This could be considered a user-facing change from the perspective of third-party code which does its own processing of Spark logs, hence the config. I think it's reasonable to set a sensible default which shrinks event logs for most users instead of keeping a conservative default to support a hypothetical third-party use case of our event logs. ### How was this patch tested? Added new test cases in JsonProtocolSuite. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
