linehrr edited a comment on issue #24461: [SPARK-27434][CORE] Fix mem leak 
URL: https://github.com/apache/spark/pull/24461#issuecomment-487990769
 
 
   looked more into the hadoop side, `FileSystem.get(URI uri, Configuration 
conf)` has caching built in side, unless specifically disabled. 
   `return conf.getBoolean(disableCacheName, false) ? createFileSystem(uri, 
conf) : CACHE.get(uri, conf);`
   
   due to this, fileSystem object will be cached and put into a map using 
generated Key: 
   
              Key(URI uri, Configuration conf) throws IOException {
                   this(uri, conf, 0L);
               }
   
               Key(URI uri, Configuration conf, long unique) throws IOException 
{
                   this.scheme = uri.getScheme() == null ? "" : 
StringUtils.toLowerCase(uri.getScheme());
                   this.authority = uri.getAuthority() == null ? "" : 
StringUtils.toLowerCase(uri.getAuthority());
                   this.unique = unique;
                   this.ugi = UserGroupInformation.getCurrentUser();
               }
   
   therefore in theory if the baseLogDir is the same, and hadoop conf don't 
change, object will be reused between spark context, however for some unknown 
reason it did not and got created every time. 
   
   on the other hand, the `close()` method is safer than you thought. first, 
it's only going close one of those many cached fileSystems that are cached, not 
all, so it's likely you are only closing the one you created. also, closing 
fileSystem does not actually CLOSE that file system, according to the close 
method: 
   ```
           public void close() throws IOException {
               this.processDeleteOnExit();
               CACHE.remove(this.key, this);
           }
   ```
   
   it deletes the pending deletions and remove that object from the cache, 
that's it. 
   so if by any chance other threads are possessing this object, it will be 
fine and that thread can continue using it. 
   most important gain from closing this file system object is to get it 
de-referenced from cache, so GC can eventually reclaim it from the heap when no 
one else has reference to it no more. 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to