JoshRosen opened a new pull request, #39767:
URL: https://github.com/apache/spark/pull/39767

   ### What changes were proposed in this pull request?
   
   This PR modifies JsonProtocol in order to skip logging of accumulator values 
in the logs for SparkListenerTaskStart and SparkListenerStageSubmitted events.
   
   The SparkListenerTaskStart and SparkListenerStageSubmitted events contain 
mutable TaskInfo and StageInfo objects, which in turn contain Accumulables 
fields. When a task or stage is submitted, Accumulables is initially empty. 
When the task or stage finishes, this field is updated with values from the 
task.
   
   If a task or stage finishes _before_ the start event has been logged by the 
event logging listener then the start event will contain the Accumulable values 
from the task or stage end event.  
   
   This PR updates JsonProtocol to log an empty Accumulables value for stage 
and task start events.
   
   I considered and rejected an alternative approach where the listener event 
itself would contain an immutable snapshot of the TaskInfo or StageInfo, as 
this will increase memory pressure on the driver during periods of heavy event 
logging.
   
   Those accumulables values in the start events are not used: I confirmed this 
by checking AppStatusListener and SQLAppStatusListener code.
   
   I have deliberately chosen to **not** drop the field for _job_ start events 
because it is technically possible (but rare) for a job to reference stages 
that are completed at the time that the job is submitted (a state can 
technically belong to multiple jobs) and in that case it seems consistent to 
have the StageInfo accurately reflect all of the information about the 
already-completed stage.
   
   ### Why are the changes needed?
   
   This information isn't used by the History Server and contributes to 
wasteful bloat in event log sizes. In one real-world log, I found that ~10% of 
the uncompressed log size was due to these redundant Accumulable fields in 
stage and task start events.
   
   I don't think that we need to worry about backwards-compatibility here 
because the old behavior was non-deterministic: whether or not a start event 
log contained accumulator updates was a function of the relative speed of task 
completion and the processing rate of the event logging listener; it seems 
unlikely that any third-party event log consumers would be relying on such an 
inconsistently present value when they could instead rely on the values in the 
corresponding end events.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   
   ### How was this patch tested?
   
   New and updated tests in JsonProtocolSuite.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to