vanzin commented on a change in pull request #26416: [SPARK-29779][CORE]
Compact old event log files and cleanup
URL: https://github.com/apache/spark/pull/26416#discussion_r357828921
##########
File path:
core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala
##########
@@ -663,13 +670,49 @@ private[history] class FsHistoryProvider(conf:
SparkConf, clock: Clock)
}
}
+ private[spark] def getOrUpdateCompactible(reader: EventLogFileReader):
Option[Boolean] = {
+ try {
+ val info = listing.read(classOf[LogInfo], reader.rootPath.toString)
+ val compactible = checkEligibilityForCompaction(info, reader)
+ if (info.compactible != compactible) {
+ listing.write(info.copy(compactible = compactible))
+ }
+ compactible
+ } catch {
+ case _: NoSuchElementException => None
+ }
+ }
+
+ protected def checkEligibilityForCompaction(
+ info: LogInfo,
+ reader: EventLogFileReader): Option[Boolean] = {
+ info.compactible.orElse {
+ // This is not applied to single event log file.
+ if (reader.lastIndex.isEmpty) {
+ Some(false)
+ } else {
+ if (reader.listEventLogFiles.length > 1) {
+ // We have at least one 'complete' file to check whether the event
log is eligible to
+ // compact further.
+ val rate = eventFilterRateCalculator.calculate(
Review comment:
> we intuitively know that compaction would help only streaming query in
most cases
If you're going to restrict yourself to that assumption, then why bother
with such a complicated approach?
You can avoid all this and just toss all the old logs files. At most, you
can create a "compact file" in one pass by keeping app start, env update, and
executor start / end events; everything else is uninteresting if all you're
interested in is keeping the latest jobs from a streaming query.
That is much, much simpler and more efficient than what you have.
But if you're going the complicated route I think it makes sense to think a
little past that one use case. e.g. a JDBC server can benefit from this. So can
a long running "shell" session (whether spark-shell or something like
Zeppelin). So can servers like Livy.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]