HeartSaVioR commented on a change in pull request #27208: [SPARK-30481][CORE] 
Integrate event log compactor into Spark History Server
URL: https://github.com/apache/spark/pull/27208#discussion_r367191271
 
 

 ##########
 File path: 
core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala
 ##########
 @@ -569,6 +570,35 @@ private[history] class FsHistoryProvider(conf: SparkConf, 
clock: Clock)
     }
   }
 
+  /**
+   * Returns a tuple containing two values. Each element means:
+   * - 1st (Boolean): true if the list of event log files are changed, false 
otherwise.
+   * - 2nd (Option[Long]): Some(value) if the method succeeds to try 
compaction,
+   *   value indicates the last event log index to try compaction. None 
otherwise.
 
 Review comment:
   Uh, I tried to differentiate "compacted log index" and "the log index Spark 
tries to compact", but the words seem to be not sufficient or appropriate. (I 
admit it sounds bad naming and maybe also bad explanation but cannot find any 
better.) It refers latter.
   
   So the reason why we store the log index into LogInfo is to avoid calling 
`compact` if possible since it's a heavy operation. How?
   
   Given we know how compaction works (especially it excludes the log file of 
the last index since it may be changing), the result of compaction is 
idempotent if we provide the same list of event log files.
   
   In other words, once we tried out for certain set of event log files, we 
don't need to try out again. For example, assuming there're 2.compact, 3, 4 in 
list of event log files. If we tried out compaction with the list once, 
regardless of the result (succeed, low score, not enough files), we don't need 
to try it again, unless we see 5 in the list of event log files.
   
   In fact it's a bit simplified and there're some exceptional cases, like 
exception happens when compacting, or configurations changed during restart of 
SHS. Former case is simple, we will fail to store the index into LogInfo 
anyway, so it should try again in next chance of checking logs. Latter case 
actually prevents us to leverage the fact, but I'd ignore it as trade-off to 
gain performance. If we address caching of state in compactor or filter then it 
may not a big deal to just call compact, but until then I guess we need this.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to