HeartSaVioR commented on issue #26416: [WIP][SPARK-29779][CORE] Compact old event log files and cleanup URL: https://github.com/apache/spark/pull/26416#issuecomment-552162424 > And IIUC, in this way, assuming that we have a stage with plenty of tasks, (e.g. 100000) and EventLoggingListener would write them into multiple rolled event files. Then, in FsHistoryProvider, when use RollingEventLogFilesFileReader to read those rolled event files (assume the stage is still running), we may compact all those event files(assume most/all of them should be compacted) in to a single compacted event file. Because, filters couldn't drop those events, mostly, task related events, e.g. SparkListenerTaskStartEvent, SparkListenerTaskEndEvent. And this would result in a still huge compacted event file. Is that right ? So the target workload would matter. We can't make all happy - the major target of the feature is streaming query. Based on the goal, the most cases we'll have only one job being live (SQL execution may make couple of jobs being logged, but it's still just a batch) and the job wouldn't be super complicated. Latency is the first class concerns on streaming workloads - we may be OK with allowing couple of seconds per batch and even 10+ of seconds, but the streaming query where a batch takes minutes wouldn't be welcomed - less and less valuable if the latency goes higher. IMHO it's more likely a hypothesis if we assume even 10000s of tasks in streaming query.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
