Github user mgaido91 commented on a diff in the pull request:
https://github.com/apache/spark/pull/21895#discussion_r206780805
--- Diff:
core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala ---
@@ -973,6 +978,42 @@ private[history] object FsHistoryProvider {
private[history] val CURRENT_LISTING_VERSION = 1L
}
+private[history] trait CachedFileSystemHelper extends Logging {
+ protected def fs: FileSystem
+ protected def expireTimeInSeconds: Long
+
+ /**
+ * LRU cache containing the result for the already checked files.
+ */
+ // Visible for testing.
+ private[history] val cache = CacheBuilder.newBuilder()
+ .expireAfterAccess(expireTimeInSeconds, TimeUnit.SECONDS)
+ .build[String, java.lang.Boolean]()
--- End diff --
Memory doesn't increase indefinitely as entries expire over the time.
Moreover, as here we are storing a string containing only the name of the file
and a Boolean, each entry is going to need about 100bytes in memory. With 100k
event logs,this means about 10MB, which doesn't seem to me a value which can
cause an OOM. Anyway, we can also add a maximum number of entries for this
cache if you think it is necessary. This would cause some more RPC calls though.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]