vanzin commented on a change in pull request #26416: [SPARK-29779][CORE] 
Compact old event log files and cleanup
URL: https://github.com/apache/spark/pull/26416#discussion_r357828921
 
 

 ##########
 File path: 
core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala
 ##########
 @@ -663,13 +670,49 @@ private[history] class FsHistoryProvider(conf: 
SparkConf, clock: Clock)
     }
   }
 
+  private[spark] def getOrUpdateCompactible(reader: EventLogFileReader): 
Option[Boolean] = {
+    try {
+      val info = listing.read(classOf[LogInfo], reader.rootPath.toString)
+      val compactible = checkEligibilityForCompaction(info, reader)
+      if (info.compactible != compactible) {
+        listing.write(info.copy(compactible = compactible))
+      }
+      compactible
+    } catch {
+      case _: NoSuchElementException => None
+    }
+  }
+
+  protected def checkEligibilityForCompaction(
+      info: LogInfo,
+      reader: EventLogFileReader): Option[Boolean] = {
+    info.compactible.orElse {
+      // This is not applied to single event log file.
+      if (reader.lastIndex.isEmpty) {
+        Some(false)
+      } else {
+        if (reader.listEventLogFiles.length > 1) {
+          // We have at least one 'complete' file to check whether the event 
log is eligible to
+          // compact further.
+          val rate = eventFilterRateCalculator.calculate(
 
 Review comment:
   > we intuitively know that compaction would help only streaming query in 
most cases
   
   If you're going to restrict yourself to that assumption, then why bother 
with such a complicated approach?
   
   You can avoid all this and just toss all the old logs files. At most, you 
can create a "compact file" in one pass by keeping app start, env update, and 
executor start / end events; everything else is uninteresting if all you're 
interested in is keeping the latest jobs from a streaming query.
   
   That is much, much simpler and more efficient than what you have.
   
   But if you're going the complicated route I think it makes sense to think a 
little past that one use case. e.g. a JDBC server can benefit from this. So can 
a long running "shell" session (whether spark-shell or something like 
Zeppelin). So can servers like Livy.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to