HeartSaVioR commented on a change in pull request #26416: [SPARK-29779][CORE] 
Compact old event log files and cleanup
URL: https://github.com/apache/spark/pull/26416#discussion_r357459465
 
 

 ##########
 File path: 
core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala
 ##########
 @@ -663,13 +670,49 @@ private[history] class FsHistoryProvider(conf: 
SparkConf, clock: Clock)
     }
   }
 
+  private[spark] def getOrUpdateCompactible(reader: EventLogFileReader): 
Option[Boolean] = {
+    try {
+      val info = listing.read(classOf[LogInfo], reader.rootPath.toString)
+      val compactible = checkEligibilityForCompaction(info, reader)
+      if (info.compactible != compactible) {
+        listing.write(info.copy(compactible = compactible))
+      }
+      compactible
+    } catch {
+      case _: NoSuchElementException => None
+    }
+  }
+
+  protected def checkEligibilityForCompaction(
+      info: LogInfo,
+      reader: EventLogFileReader): Option[Boolean] = {
+    info.compactible.orElse {
+      // This is not applied to single event log file.
+      if (reader.lastIndex.isEmpty) {
+        Some(false)
+      } else {
+        if (reader.listEventLogFiles.length > 1) {
+          // We have at least one 'complete' file to check whether the event 
log is eligible to
+          // compact further.
+          val rate = eventFilterRateCalculator.calculate(
 
 Review comment:
   That's the effort of avoiding overheads on event log on cases where 
compaction wouldn't help, as replaying event log file is expensive.
   
   You're right if we imagine arbitrary cases on supporting cases, but we 
intuitively know that compaction would help only streaming query in most cases 
(please correct me if I'm missing here), which should have very short execution 
per job. (Actually streaming query is the target of the workload for the 
umbrella issue.)
   
   In streaming world, latency is already turned into sub-second (even there's 
claim for sub-millisecond). Spark may show seconds of latency but even a minute 
of run per a batch is bad even in micro-batch. Given that each event log file 
cannot be smaller than 10mb (we can measure the file sizes and decide how many 
files to analyze if we don't want to rely on this fact), the characteristic 
should have been shown in first file. And the characteristic will not change 
across event log files, which opens another optimization for "avoiding" trial 
for compaction if it was analyzed as "no compaction" before.
   
   That's based on heuristic; there's also a way like measuring statistics like 
you suggested if we want to make the logic be smarter, but by leveraging 
heuristic we can simplify the logic (to become smarter) and also avoid more 
overheads. What do you think?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to