Github user jianjianjiao commented on a diff in the pull request:
https://github.com/apache/spark/pull/22444#discussion_r218292773
--- Diff:
core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala ---
@@ -465,20 +475,31 @@ private[history] class FsHistoryProvider(conf:
SparkConf, clock: Clock)
}
} catch {
case _: NoSuchElementException =>
- // If the file is currently not being tracked by the SHS,
add an entry for it and try
- // to parse it. This will allow the cleaner code to detect
the file as stale later on
- // if it was not possible to parse it.
- listing.write(LogInfo(entry.getPath().toString(),
newLastScanTime, None, None,
- entry.getLen()))
--- End diff --
Hi, @squito thanks for looking into this PR.
When Spark history starts, it will scan event logs folder, and using
multi-threads to handle. it will not do next scan before the first finishes.
That is the problem, in our cluster, there are about 20K event-log files(often
bigger than 1G), including like 1K .inprogress files, it takes about 2 and a
half hours to do the first scan. that means, during this 2.5 hours, if an user
submit a spark application, and it finishes, user cannot find it via the spark
history UI, and has to wait for the next scan.
That is why I add a limit of how much to scan each time, like set to 3K.
That means no matter how many log files in the event-logs folder, it will first
scan the first 3K and handle them, and then do the second scan, let's assume
that during the first scan, there are 5 applications scanned, and there are
another 10 applications updated. then the second scan will handle these 15
applications and another 2885 files ( from 3001 to 5885) in the event folder.
checkForLogs scan event-log folders, and only handles files that are
updated or not handled.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]