JoshRosen opened a new pull request, #39767: URL: https://github.com/apache/spark/pull/39767
### What changes were proposed in this pull request? This PR modifies JsonProtocol in order to skip logging of accumulator values in the logs for SparkListenerTaskStart and SparkListenerStageSubmitted events. The SparkListenerTaskStart and SparkListenerStageSubmitted events contain mutable TaskInfo and StageInfo objects, which in turn contain Accumulables fields. When a task or stage is submitted, Accumulables is initially empty. When the task or stage finishes, this field is updated with values from the task. If a task or stage finishes _before_ the start event has been logged by the event logging listener then the start event will contain the Accumulable values from the task or stage end event. This PR updates JsonProtocol to log an empty Accumulables value for stage and task start events. I considered and rejected an alternative approach where the listener event itself would contain an immutable snapshot of the TaskInfo or StageInfo, as this will increase memory pressure on the driver during periods of heavy event logging. Those accumulables values in the start events are not used: I confirmed this by checking AppStatusListener and SQLAppStatusListener code. I have deliberately chosen to **not** drop the field for _job_ start events because it is technically possible (but rare) for a job to reference stages that are completed at the time that the job is submitted (a state can technically belong to multiple jobs) and in that case it seems consistent to have the StageInfo accurately reflect all of the information about the already-completed stage. ### Why are the changes needed? This information isn't used by the History Server and contributes to wasteful bloat in event log sizes. In one real-world log, I found that ~10% of the uncompressed log size was due to these redundant Accumulable fields in stage and task start events. I don't think that we need to worry about backwards-compatibility here because the old behavior was non-deterministic: whether or not a start event log contained accumulator updates was a function of the relative speed of task completion and the processing rate of the event logging listener; it seems unlikely that any third-party event log consumers would be relying on such an inconsistently present value when they could instead rely on the values in the corresponding end events. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New and updated tests in JsonProtocolSuite. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
