vanzin commented on a change in pull request #26416: [SPARK-29779][CORE]
Compact old event log files and cleanup
URL: https://github.com/apache/spark/pull/26416#discussion_r358456155
##########
File path:
core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala
##########
@@ -663,13 +670,49 @@ private[history] class FsHistoryProvider(conf:
SparkConf, clock: Clock)
}
}
+ private[spark] def getOrUpdateCompactible(reader: EventLogFileReader):
Option[Boolean] = {
+ try {
+ val info = listing.read(classOf[LogInfo], reader.rootPath.toString)
+ val compactible = checkEligibilityForCompaction(info, reader)
+ if (info.compactible != compactible) {
+ listing.write(info.copy(compactible = compactible))
+ }
+ compactible
+ } catch {
+ case _: NoSuchElementException => None
+ }
+ }
+
+ protected def checkEligibilityForCompaction(
+ info: LogInfo,
+ reader: EventLogFileReader): Option[Boolean] = {
+ info.compactible.orElse {
+ // This is not applied to single event log file.
+ if (reader.lastIndex.isEmpty) {
+ Some(false)
+ } else {
+ if (reader.listEventLogFiles.length > 1) {
+ // We have at least one 'complete' file to check whether the event
log is eligible to
+ // compact further.
+ val rate = eventFilterRateCalculator.calculate(
Review comment:
Haven't looked at the updated code yet, but:
> And how many files/lines/bytes we should read to decide whether the app
doesn't need to be analyzed further (to even skip reading first phase read)?
I think instead of thinking like that you'll need to think about what kind
of state do you need to maintain between scans (so you don't have to re-scan
the same log files over and over), and what kinds of things you can get rid of.
For example, things like job/stage/task retention configs can make a difference
here.
But agree, do that in a separate change. This one is already pretty large as
is.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]