HeartSaVioR commented on issue #26416: [WIP][SPARK-29779][CORE] Compact old 
event log files and cleanup
URL: https://github.com/apache/spark/pull/26416#issuecomment-552162424
 
 
   > And IIUC, in this way, assuming that we have a stage with plenty of tasks, 
(e.g. 100000) and EventLoggingListener would write them into multiple rolled 
event files. Then, in FsHistoryProvider, when use 
RollingEventLogFilesFileReader to read those rolled event files (assume the 
stage is still running), we may compact all those event files(assume most/all 
of them should be compacted) in to a single compacted event file. Because, 
filters couldn't drop those events, mostly, task related events, e.g. 
SparkListenerTaskStartEvent, SparkListenerTaskEndEvent. And this would result 
in a still huge compacted event file. Is that right ?
   
   So the target workload would matter. We can't make all happy - the major 
target of the feature is streaming query. Based on the goal, the most cases 
we'll have only one job being live (SQL execution may make couple of jobs being 
logged, but it's still just a batch) and the job wouldn't be super complicated. 
Latency is the first class concerns on streaming workloads - we may be OK with 
allowing couple of seconds per batch and even 10+ of seconds, but the streaming 
query where a batch takes minutes wouldn't be welcomed - less and less valuable 
if the latency goes higher. IMHO it's more likely a hypothesis if we assume 
even 10000s of tasks in streaming query.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to