Github user mgaido91 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21895#discussion_r206780805
  
    --- Diff: 
core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala ---
    @@ -973,6 +978,42 @@ private[history] object FsHistoryProvider {
       private[history] val CURRENT_LISTING_VERSION = 1L
     }
     
    +private[history] trait CachedFileSystemHelper extends Logging {
    +  protected def fs: FileSystem
    +  protected def expireTimeInSeconds: Long
    +
    +  /**
    +   * LRU cache containing the result for the already checked files.
    +   */
    +  // Visible for testing.
    +  private[history] val cache = CacheBuilder.newBuilder()
    +    .expireAfterAccess(expireTimeInSeconds, TimeUnit.SECONDS)
    +    .build[String, java.lang.Boolean]()
    --- End diff --
    
    Memory doesn't increase indefinitely as entries expire over the time. 
Moreover, as here we are storing a string  containing only the name of the file 
and a Boolean, each entry is going to need about 100bytes in memory. With 100k 
event logs,this means about 10MB, which doesn't seem to me a value which can 
cause an OOM. Anyway, we can also add a maximum number of entries for this 
cache if you think it is necessary. This would cause some more RPC calls though.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to