HeartSaVioR commented on a change in pull request #27208: [SPARK-30481][CORE] 
Integrate event log compactor into Spark History Server
URL: https://github.com/apache/spark/pull/27208#discussion_r369827462
 
 

 ##########
 File path: 
core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala
 ##########
 @@ -795,15 +800,42 @@ private[history] class FsHistoryProvider(conf: 
SparkConf, clock: Clock)
         // mean the end event is before the configured threshold, so call the 
method again to
         // re-parse the whole log.
         logInfo(s"Reparsing $logPath since end event was not found.")
-        doMergeApplicationListing(reader, scanTime, enableOptimizations = 
false)
+        doMergeApplicationListing(reader, scanTime, enableOptimizations = 
false,
+          lastEvaluatedForCompaction)
 
       case _ =>
         // If the app hasn't written down its app ID to the logs, still record 
the entry in the
         // listing db, with an empty ID. This will make the log eligible for 
deletion if the app
         // does not make progress after the configured max log age.
         listing.write(
           LogInfo(logPath.toString(), scanTime, LogType.EventLogs, None, None,
-            reader.fileSizeForLastIndex, reader.lastIndex, reader.completed))
+            reader.fileSizeForLastIndex, reader.lastIndex, 
lastEvaluatedForCompaction,
+            reader.completed))
+    }
+  }
+
+  private def compact(reader: EventLogFileReader): Unit = {
+    val rootPath = reader.rootPath
+    try {
+      reader.lastIndex match {
+        case Some(lastIndex) =>
+          try {
+            val info = listing.read(classOf[LogInfo], reader.rootPath.toString)
+            if (info.lastEvaluatedForCompaction.isEmpty ||
+              info.lastEvaluatedForCompaction.get < lastIndex) {
+              // haven't tried compaction for this index, do compaction
+              fileCompactor.compact(reader.listEventLogFiles)
 
 Review comment:
   > So one thing that feels a tiny bit odd is that when deciding whether to 
compact, you're actually considering the last log file, which you won't 
consider during actual compaction, right?
   > Wouldn't that cause unnecessary (or too aggressive) compaction at the end 
of the application, when potentially a bunch of jobs finish and "release" lots 
of tasks, inflating the compation scoe?
   
   That's the intention that callers of compactor don't care about how many 
files are actually affected. Callers of compactor just need to know that same 
list of log files would bring same result, unless it fails and throws 
exception. How many files are excluded in compaction is just a configuration, 
and the last log file should be excluded is an implementation detail. (We 
prevent it in both configuration and compactor via having 1 as min value for 
max retain log file.)
   
   Compactor will ignore the last log file in any way as configured, so unless 
the rare case where the log is rolled just before the app is finished, it won't 
happen. And most probably end users would avoid to set the value to 1 if they 
read the doc and understand how it works.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to