HeartSaVioR commented on a change in pull request #27208: [SPARK-30481][CORE]
Integrate event log compactor into Spark History Server
URL: https://github.com/apache/spark/pull/27208#discussion_r367191271
##########
File path:
core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala
##########
@@ -569,6 +570,35 @@ private[history] class FsHistoryProvider(conf: SparkConf,
clock: Clock)
}
}
+ /**
+ * Returns a tuple containing two values. Each element means:
+ * - 1st (Boolean): true if the list of event log files are changed, false
otherwise.
+ * - 2nd (Option[Long]): Some(value) if the method succeeds to try
compaction,
+ * value indicates the last event log index to try compaction. None
otherwise.
Review comment:
Uh, I tried to differentiate "compacted log index" and "the log index Spark
tries to compact", but the words weren't sufficient or appropriate. (I admit it
sounds bad naming and maybe also bad explanation but cannot find any better.)
It refers latter.
So the reason why we store the log index into LogInfo is to avoid calling
`compact` if possible since it's a heavy operation. How?
Given we know how compaction works (especially it excludes the log file of
the last index since it may be changing), the result of compaction is
idempotent if we provide the same list of event log files.
In other words, once we tried out for certain set of event log files, we
don't need to try out again. For example, assuming there're 2.compact, 3, 4 in
list of event log files. If we tried out compaction with the list once,
regardless of the result (succeed, low score, not enough files), we don't need
to try it again, unless we see 5 in the list of event log files.
In fact it's a bit simplified and there're some exceptional cases, like
exception happens when compacting, or configurations changed during restart of
SHS. Former case is simple, we will fail to store the index into LogInfo
anyway, so it should try again in next chance of checking logs. Latter case
actually prevents us to leverage the fact, but I'd ignore it as trade-off to
gain performance. If we address caching of state in compactor or filter then it
may not a big deal to just call compact, but until then I guess we need this.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]