vanzin commented on a change in pull request #26416: [SPARK-29779][CORE] 
Compact old event log files and cleanup
URL: https://github.com/apache/spark/pull/26416#discussion_r357442623
 
 

 ##########
 File path: 
core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala
 ##########
 @@ -663,13 +670,49 @@ private[history] class FsHistoryProvider(conf: 
SparkConf, clock: Clock)
     }
   }
 
+  private[spark] def getOrUpdateCompactible(reader: EventLogFileReader): 
Option[Boolean] = {
+    try {
+      val info = listing.read(classOf[LogInfo], reader.rootPath.toString)
+      val compactible = checkEligibilityForCompaction(info, reader)
+      if (info.compactible != compactible) {
+        listing.write(info.copy(compactible = compactible))
+      }
+      compactible
+    } catch {
+      case _: NoSuchElementException => None
+    }
+  }
+
+  protected def checkEligibilityForCompaction(
+      info: LogInfo,
+      reader: EventLogFileReader): Option[Boolean] = {
+    info.compactible.orElse {
+      // This is not applied to single event log file.
+      if (reader.lastIndex.isEmpty) {
+        Some(false)
+      } else {
+        if (reader.listEventLogFiles.length > 1) {
+          // We have at least one 'complete' file to check whether the event 
log is eligible to
+          // compact further.
+          val rate = eventFilterRateCalculator.calculate(
 
 Review comment:
   Ok, here's where you try to define whether compaction is helpful. But you're 
only basing it on the first log file.
   
   What if a new log file completes after the first one and a bunch of jobs and 
stages finish, and now compactible would actually help a lot?
   
   It seems a little short-sighted to base this on the first file alone.
   
   And if scanning more than one file (so you capture the cases I just 
mentioned), why not try to make the listener smarter? You may not get the exact 
same result, but you have a good idea of how many events are being replayed, 
and how many you'd filter out; e.g. when a job finishes, you'd filter out all 
the stage and task events related to that job. That should give you some 
similar metric to this one, right? Without the need to have this calculator run 
separately from the compaction itself?
   
   The you just do one pass to feed the listeners, and if the listeners say 
that there are enough events to warrant it, you compact the log files.
   
   (The downside is that as the number of log files grow, this would get more 
expensive, so maybe set a limit. "If things don't look good after X log files, 
just give up compaction.")

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to