HeartSaVioR commented on a change in pull request #26416: [SPARK-29779][CORE]
Compact old event log files and cleanup
URL: https://github.com/apache/spark/pull/26416#discussion_r358260674
##########
File path:
core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala
##########
@@ -663,13 +670,49 @@ private[history] class FsHistoryProvider(conf:
SparkConf, clock: Clock)
}
}
+ private[spark] def getOrUpdateCompactible(reader: EventLogFileReader):
Option[Boolean] = {
+ try {
+ val info = listing.read(classOf[LogInfo], reader.rootPath.toString)
+ val compactible = checkEligibilityForCompaction(info, reader)
+ if (info.compactible != compactible) {
+ listing.write(info.copy(compactible = compactible))
+ }
+ compactible
+ } catch {
+ case _: NoSuchElementException => None
+ }
+ }
+
+ protected def checkEligibilityForCompaction(
+ info: LogInfo,
+ reader: EventLogFileReader): Option[Boolean] = {
+ info.compactible.orElse {
+ // This is not applied to single event log file.
+ if (reader.lastIndex.isEmpty) {
+ Some(false)
+ } else {
+ if (reader.listEventLogFiles.length > 1) {
+ // We have at least one 'complete' file to check whether the event
log is eligible to
+ // compact further.
+ val rate = eventFilterRateCalculator.calculate(
Review comment:
I'll just consolidate filter rate calculator and compactor into one so that
event filters built from first phase read can be applied to both scoring and
compacting.
I just took simplest calculation when calculate "score": given the
calculation would be based on heuristic (to avoid reading event logs again if
the compaction doesn't seem to help... that's all what I have been struggling
for), improving the calculation requires experiences on dealing with bunch of
cases of event logs which I honestly don't have and should take time if we
require it in the scope of this PR. I guess the simple calculation would work
for streaming scenario (maybe as well as long running "shell" session which
runs interactive queries) so it seems OK to start from here and exclude the
improvement from the scope of this PR.
And how many files/lines/bytes we should read to decide whether the app
doesn't need to be analyzed further (to even skip reading first phase read)?
This seems to be require another heuristic, which I'd also like to exclude from
the scope of this PR. Currently I updated the logic to calculate each time
where new index of event log file comes.
We may also be able to improve building event filter via incremental read
per event log file which might be another point of improvement. I'll see how to
deal with this, but it'd be ideal if we can also exclude it the scope of this
PR.
Does the proposal work for you, or which things you consider as mandatory?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]