oldbelvey opened a new issue, #13288:
URL: https://github.com/apache/dolphinscheduler/issues/13288

   ### Search before asking
   
   - [X] I had searched in the 
[issues](https://github.com/apache/dolphinscheduler/issues?q=is%3Aissue) and 
found no similar issues.
   
   
   ### What happened
   
   Worker OOM,  look into the dump file, 
   there are 91879 objects of 'org.apache.hadoop.conf.Configuration'  which 
consume 94% of the total 8G memeory (retained heap).
   below is the gc root of Configuration object.
   ```
   Class Name                                                                   
                                                      | Shallow Heap | Retained 
Heap
   
------------------------------------------------------------------------------------------------------------------------------------------------------------------
   org.apache.hadoop.conf.Configuration @ 0x5f880d4f8                           
                                                      |           40 |        
85,072
   |- conf 
org.apache.hadoop.hdfs.server.namenode.ha.ZkConfiguredFailoverProxyProvider @ 
0x5f880d450                                  |           96 |           296
   |  |- proxyProvider org.apache.hadoop.io.retry.RetryInvocationHandler @ 
0x5f8837850                                                |           40 |     
    1,592
   |  |  |- h com.sun.proxy.$Proxy138 @ 0x5f883f410                             
                                                      |           16 |         
1,608
   |  |  |  |- federatedNamenode org.apache.hadoop.hdfs.DFSClient @ 0x7f877c610 
                                                      |          128 |         
4,160
   |  |  |  |  |- dfs org.apache.hadoop.hdfs.DistributedFileSystem @ 
0x7f877c2f0                                                      |           56 
|         5,336
   |  |  |  |  |  |- fs org.apache.hadoop.fs.viewfs.ChRootedFileSystem @ 
0x7f873e400                                                  |           56 |   
      6,256
   |  |  |  |  |  |  |- fs 
org.apache.hadoop.fs.viewfs.MergedInodeTree$INodeMerge @ 0x7f873d7d0            
                           |           32 |         6,984
   |  |  |  |  |  |  |  |- target 
org.apache.hadoop.fs.viewfs.InodeTree$MountPoint @ 0x5f8841af0                  
                    |           24 |           136
   |  |  |  |  |  |  |  |  |- [171] java.lang.Object[244] @ 0x7f7ec7330         
                                                      |          992 |        
27,480
   |  |  |  |  |  |  |  |  |  |- elementData java.util.ArrayList @ 0x7e9599498  
                                                      |           24 |        
27,504
   |  |  |  |  |  |  |  |  |  |  |- mountPoints 
org.apache.hadoop.fs.viewfs.ViewFileSystem$1 @ 0x7e9599478                      
      |           32 |     1,616,104
   |  |  |  |  |  |  |  |  |  |  |  |- fsState 
org.apache.hadoop.fs.viewfs.ViewFileSystem @ 0x6bb0d4318                        
       |           72 |     1,617,440
   |  |  |  |  |  |  |  |  |  |  |  |  |- viewFs 
org.apache.hadoop.hdfs.FederatedDFSFileSystem @ 0x6bb405ad8                     
     |           64 |     1,618,016
   |  |  |  |  |  |  |  |  |  |  |  |  |  |- this$0 
org.apache.hadoop.hdfs.FederatedDFSFileSystem$1 @ 0x6bb55eab8                   
  |           16 |            16
   |  |  |  |  |  |  |  |  |  |  |  |  |  |  |- renewMpt 
org.apache.hadoop.hdfs.MountPointRenewer @ 0x6bb55a0a0                       |  
         64 |        57,720
   |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |- this$0 
org.apache.hadoop.hdfs.MountPointRenewer$3 @ 0x5fb360928                    |   
        40 |        57,776
   |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |- <Java Local> 
java.util.TimerThread @ 0x5fb360748  Timer-35086 Thread            |          
128 |           304
   |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |- [1] 
java.util.TimerTask[128] @ 0x5fb360538                                      |   
       528 |           528
   |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |- queue 
java.util.TaskQueue @ 0x5fb360520 Busy Monitor                         |        
   24 |           552
   |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  '- Total: 1 entry         
                                                      |              |          
    
   
   ```
   
   
   
   
   
   
   
   ### What you expected to happen
   
   
   It'  Timer in `org.apache.hadoop.hdfs.MountPointRenewer` keep the object 
away from GC.
   
   The MountPointRenewer is most likely  coming from FileSystem class and there 
are  192 FileSystem instance. 
   
   I find FileSystem class from the source code, and it only used in 
`org.apache.dolphinscheduler.common.utils.HadoopUtils` and i see the code below.
   
   ```
       private static final LoadingCache<String, HadoopUtils> cache = 
CacheBuilder
               .newBuilder()
               
.expireAfterWrite(PropertyUtils.getInt(Constants.KERBEROS_EXPIRE_TIME, 2), 
TimeUnit.HOURS)
               .build(new CacheLoader<String, HadoopUtils>() {
                   @Override
                   public HadoopUtils load(String key) throws Exception {
                       return new HadoopUtils();
                   }
               });
   ```
   
   By default the `HadoopUtils` is generate  every 2 hours, and  the filesystem 
is never closed.
   
   ### How to reproduce
   
   1. dolphin version 1.3.6 (later version same)
   2. dolphin common.properties: resource.storage.type=HDFS
   3.  hadoop core-site.xml : fs.AbstractFileSystem.hdfs.impl = 
org.apache.hadoop.fs.FederatedHdfs
   
   using these config the worker OOM every few days or month, depends on the 
memory we config.
   
   ### Anything else
   
   In my  opinion there is no need to reproduce HadoopUtils   every 2 hours 
using google cache(LoadingCache).
   
   and I am changing it to single instance , and using 
`UserGroupInformation.getLoginUser().checkTGTAndReloginFromKeytab();` to 
refresh the kerberos ticket.
   
   ### Version
   
   dev
   
   ### Are you willing to submit PR?
   
   - [ ] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of 
Conduct](https://www.apache.org/foundation/policies/conduct)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: 
[email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to