vanzin commented on a change in pull request #26416: [SPARK-29779][CORE] 
Compact old event log files and cleanup
URL: https://github.com/apache/spark/pull/26416#discussion_r358456155
 
 

 ##########
 File path: 
core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala
 ##########
 @@ -663,13 +670,49 @@ private[history] class FsHistoryProvider(conf: 
SparkConf, clock: Clock)
     }
   }
 
+  private[spark] def getOrUpdateCompactible(reader: EventLogFileReader): 
Option[Boolean] = {
+    try {
+      val info = listing.read(classOf[LogInfo], reader.rootPath.toString)
+      val compactible = checkEligibilityForCompaction(info, reader)
+      if (info.compactible != compactible) {
+        listing.write(info.copy(compactible = compactible))
+      }
+      compactible
+    } catch {
+      case _: NoSuchElementException => None
+    }
+  }
+
+  protected def checkEligibilityForCompaction(
+      info: LogInfo,
+      reader: EventLogFileReader): Option[Boolean] = {
+    info.compactible.orElse {
+      // This is not applied to single event log file.
+      if (reader.lastIndex.isEmpty) {
+        Some(false)
+      } else {
+        if (reader.listEventLogFiles.length > 1) {
+          // We have at least one 'complete' file to check whether the event 
log is eligible to
+          // compact further.
+          val rate = eventFilterRateCalculator.calculate(
 
 Review comment:
   Haven't looked at the updated code yet, but:
   
   > And how many files/lines/bytes we should read to decide whether the app 
doesn't need to be analyzed further (to even skip reading first phase read)?
   
   I think instead of thinking like that you'll need to think about what kind 
of state do you need to maintain between scans (so you don't have to re-scan 
the same log files over and over), and what kinds of things you can get rid of. 
For example, things like job/stage/task retention configs can make a difference 
here.
   
   But agree, do that in a separate change. This one is already pretty large as 
is.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to