[jira] [Work logged] (HIVE-26404) HMS memory leak when compaction cleaner fails to remove obsolete files

ASF GitHub Bot (Jira) Thu, 15 Sep 2022 15:09:35 -0700


     [ 
https://issues.apache.org/jira/browse/HIVE-26404?focusedWorklogId=809332&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-809332
 ]


ASF GitHub Bot logged work on HIVE-26404:
-----------------------------------------

                Author: ASF GitHub Bot
            Created on: 15/Sep/22 22:08
            Start Date: 15/Sep/22 22:08
    Worklog Time Spent: 10m 
      Work Description: sonarcloud[bot] commented on PR #3514:
URL: https://github.com/apache/hive/pull/3514#issuecomment-1248685147

   Kudos, SonarCloud Quality Gate passed!&nbsp; &nbsp; [![Quality Gate 
passed](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/QualityGateBadge/passed-16px.png
 'Quality Gate 
passed')](https://sonarcloud.io/dashboard?id=apache_hive&pullRequest=3514)
   
   
[![Bug](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/bug-16px.png
 
'Bug')](https://sonarcloud.io/project/issues?id=apache_hive&pullRequest=3514&resolved=false&types=BUG)
 
[![C](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/C-16px.png
 
'C')](https://sonarcloud.io/project/issues?id=apache_hive&pullRequest=3514&resolved=false&types=BUG)
 [1 
Bug](https://sonarcloud.io/project/issues?id=apache_hive&pullRequest=3514&resolved=false&types=BUG)
  
   
[![Vulnerability](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/vulnerability-16px.png
 
'Vulnerability')](https://sonarcloud.io/project/issues?id=apache_hive&pullRequest=3514&resolved=false&types=VULNERABILITY)
 
[![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png
 
'A')](https://sonarcloud.io/project/issues?id=apache_hive&pullRequest=3514&resolved=false&types=VULNERABILITY)
 [0 
Vulnerabilities](https://sonarcloud.io/project/issues?id=apache_hive&pullRequest=3514&resolved=false&types=VULNERABILITY)
  
   [![Security 
Hotspot](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/security_hotspot-16px.png
 'Security 
Hotspot')](https://sonarcloud.io/project/security_hotspots?id=apache_hive&pullRequest=3514&resolved=false&types=SECURITY_HOTSPOT)
 
[![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png
 
'A')](https://sonarcloud.io/project/security_hotspots?id=apache_hive&pullRequest=3514&resolved=false&types=SECURITY_HOTSPOT)
 [0 Security 
Hotspots](https://sonarcloud.io/project/security_hotspots?id=apache_hive&pullRequest=3514&resolved=false&types=SECURITY_HOTSPOT)
  
   [![Code 
Smell](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/code_smell-16px.png
 'Code 
Smell')](https://sonarcloud.io/project/issues?id=apache_hive&pullRequest=3514&resolved=false&types=CODE_SMELL)
 
[![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png
 
'A')](https://sonarcloud.io/project/issues?id=apache_hive&pullRequest=3514&resolved=false&types=CODE_SMELL)
 [6 Code 
Smells](https://sonarcloud.io/project/issues?id=apache_hive&pullRequest=3514&resolved=false&types=CODE_SMELL)
   
   [![No Coverage 
information](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/CoverageChart/NoCoverageInfo-16px.png
 'No Coverage 
information')](https://sonarcloud.io/component_measures?id=apache_hive&pullRequest=3514&metric=coverage&view=list)
 No Coverage information  
   [![No Duplication 
information](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/Duplications/NoDuplicationInfo-16px.png
 'No Duplication 
information')](https://sonarcloud.io/component_measures?id=apache_hive&pullRequest=3514&metric=duplicated_lines_density&view=list)
 No Duplication information
   
   




Issue Time Tracking
-------------------

    Worklog Id:     (was: 809332)
    Time Spent: 20m  (was: 10m)

> HMS memory leak when compaction cleaner fails to remove obsolete files
> ----------------------------------------------------------------------
>
>                 Key: HIVE-26404
>                 URL: https://issues.apache.org/jira/browse/HIVE-26404
>             Project: Hive
>          Issue Type: Bug
>          Components: Metastore
>    Affects Versions: 4.0.0-alpha-1
>            Reporter: Stamatis Zampetakis
>            Assignee: Stamatis Zampetakis
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> While investigating an issue where HMS becomes unresponsive we noticed a lot 
> of failed attempts from the compaction Cleaner thread to remove obsolete 
> directories with exceptions similar to the one below.
> {noformat}
> 2022-06-16 05:48:24,819 ERROR 
> org.apache.hadoop.hive.ql.txn.compactor.Cleaner: [Cleaner-executor-thread-0]: 
> Caught exception when cleaning, unable to complete cleaning of 
> id:4410976,dbname:my_database,tableName:my_table,partName:day=20220502,state:,type:MAJOR,enqueueTime:0,start:0,properties:null,runAs:some_user,tooManyAborts:false,hasOldAbort:false,highestWriteId:187502,errorMessage:null
>  java.io.IOException: Not enough history available for (187502,x).  Oldest 
> available base: 
> hdfs://nameservice1/warehouse/tablespace/managed/hive/my_database.db/my_table/day=20220502/base_0188687_v4297872
>       at 
> org.apache.hadoop.hive.ql.io.AcidUtils.getAcidState(AcidUtils.java:1432)
>       at 
> org.apache.hadoop.hive.ql.txn.compactor.Cleaner.removeFiles(Cleaner.java:261)
>       at 
> org.apache.hadoop.hive.ql.txn.compactor.Cleaner.access$000(Cleaner.java:71)
>       at 
> org.apache.hadoop.hive.ql.txn.compactor.Cleaner$1.run(Cleaner.java:203)
>       at java.security.AccessController.doPrivileged(Native Method)
>       at javax.security.auth.Subject.doAs(Subject.java:422)
>       at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1898)
>       at 
> org.apache.hadoop.hive.ql.txn.compactor.Cleaner.clean(Cleaner.java:200)
>       at 
> org.apache.hadoop.hive.ql.txn.compactor.Cleaner.lambda$run$0(Cleaner.java:105)
>       at 
> org.apache.hadoop.hive.ql.txn.compactor.CompactorUtil$ThrowingRunnable.lambda$unchecked$0(CompactorUtil.java:54)
>       at 
> java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1640)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>       at java.lang.Thread.run(Thread.java:748)
> {noformat}
> In addition the logs contained a large number of long JVM pauses as shown 
> below and the HMS (RSZ) memory kept increasing at rate of 90MB per hour.
> {noformat}
> 2022-06-16 16:17:17,805 WARN  
> org.apache.hadoop.hive.metastore.metrics.JvmPauseMonitor: 
> [org.apache.hadoop.hive.metastore.metrics.JvmPauseMonitor$Monitor@5b022296]: 
> Detected pause in JVM or host machine (eg GC): pause of approximately 34346ms
> 2022-06-16 16:17:21,497 INFO  
> org.apache.hadoop.hive.metastore.metrics.JvmPauseMonitor: 
> [org.apache.hadoop.hive.metastore.metrics.JvmPauseMonitor$Monitor@5b022296]: 
> Detected pause in JVM or host machine (eg GC): pause of approximately 1690ms
> 2022-06-16 16:17:57,696 WARN  
> org.apache.hadoop.hive.metastore.metrics.JvmPauseMonitor: 
> [org.apache.hadoop.hive.metastore.metrics.JvmPauseMonitor$Monitor@5b022296]: 
> Detected pause in JVM or host machine (eg GC): pause of approximately 34697ms
> 2022-06-16 16:18:01,326 INFO  
> org.apache.hadoop.hive.metastore.metrics.JvmPauseMonitor: 
> [org.apache.hadoop.hive.metastore.metrics.JvmPauseMonitor$Monitor@5b022296]: 
> Detected pause in JVM or host machine (eg GC): pause of approximately 1628ms
> 2022-06-16 16:18:37,280 WARN  
> org.apache.hadoop.hive.metastore.metrics.JvmPauseMonitor: 
> [org.apache.hadoop.hive.metastore.metrics.JvmPauseMonitor$Monitor@5b022296]: 
> Detected pause in JVM or host machine (eg GC): pause of approximately 34453ms
> 2022-06-16 16:18:40,927 INFO  
> org.apache.hadoop.hive.metastore.metrics.JvmPauseMonitor: 
> [org.apache.hadoop.hive.metastore.metrics.JvmPauseMonitor$Monitor@5b022296]: 
> Detected pause in JVM or host machine (eg GC): pause of approximately 1646ms
> 2022-06-16 16:19:16,929 WARN  
> org.apache.hadoop.hive.metastore.metrics.JvmPauseMonitor: 
> [org.apache.hadoop.hive.metastore.metrics.JvmPauseMonitor$Monitor@5b022296]: 
> Detected pause in JVM or host machine (eg GC): pause of approximately 33997ms
> 2022-06-16 16:19:20,572 INFO  
> org.apache.hadoop.hive.metastore.metrics.JvmPauseMonitor: 
> [org.apache.hadoop.hive.metastore.metrics.JvmPauseMonitor$Monitor@5b022296]: 
> Detected pause in JVM or host machine (eg GC): pause of approximately 1637ms
> 2022-06-16 16:20:01,643 WARN  
> org.apache.hadoop.hive.metastore.metrics.JvmPauseMonitor: 
> [org.apache.hadoop.hive.metastore.metrics.JvmPauseMonitor$Monitor@5b022296]: 
> Detected pause in JVM or host machine (eg GC): pause of approximately 39329ms
> 2022-06-16 16:20:05,572 INFO  
> org.apache.hadoop.hive.metastore.metrics.JvmPauseMonitor: 
> [org.apache.hadoop.hive.metastore.metrics.JvmPauseMonitor$Monitor@5b022296]: 
> Detected pause in JVM or host machine (eg GC): pause of approximately 1927ms
> {noformat}
> We took a heapdump of the HMS around the time that it becomes unresponsive 
> and we have seen many Configuration objects (~40K) occupying more than 90% of 
> the current heap (~9GB).
> {noformat}
> Class Name                                   |     Objects |  Shallow Heap |  
>   Retained Heap
> ----------------------------------------------------------------------------------------------
> org.apache.hadoop.conf.Configuration         |      39,452 |     1,893,696 | 
> >= 8,560,573,960
> java.util.concurrent.ConcurrentHashMap       |     155,863 |     9,975,232 | 
> >= 4,696,003,968
> java.util.concurrent.ConcurrentHashMap$Node[]|     139,348 | 1,312,967,944 | 
> >= 4,686,230,296
> java.util.Properties                         |      87,119 |     4,181,712 | 
> >= 4,193,638,904
> java.util.Hashtable$Entry[]                  |      87,840 |   987,968,472 | 
> >= 4,189,518,928
> java.util.concurrent.ConcurrentHashMap$Node  |  99,097,078 | 3,171,106,496 | 
> >= 3,375,319,552
> java.util.Hashtable$Entry                    | 100,047,081 | 3,201,506,592 | 
> >= 3,201,551,936
> org.postgresql.jdbc.PgConnection             |       6,488 |       830,464 |  
>  >= 551,442,952
> ----------------------------------------------------------------------------------------------
> {noformat}
> It turns out that these Configuration objects are all referenced by CACHE 
> entries in org.apache.hadoop.fs.FileSystem$Cache.
> {noformat}
> Class Name                                                                    
>          | Shallow Heap | Retained Heap
> ----------------------------------------------------------------------------------------------------------------------
> org.apache.hadoop.fs.FileSystem$Cache @ 0x45403fe70                           
>          |           32 |   108,671,824
> |- <class> class org.apache.hadoop.fs.FileSystem$Cache @ 0x45410c3e0          
>          |            8 |           544
> '- map java.util.HashMap @ 0x453ffb598                                        
>          |           48 |    92,777,232
>    |- <class> class java.util.HashMap @ 0x4520382c8 System Class              
>          |           40 |           168
>    |- entrySet java.util.HashMap$EntrySet @ 0x454077848                       
>          |           16 |            16
>    '- table java.util.HashMap$Node[32768] @ 0x463585b68                       
>          |      131,088 |    92,777,168
>       |- class java.util.HashMap$Node[] @ 0x4520b7790                         
>          |            0 |             0
>       '- [1786] java.util.HashMap$Node @ 0x451998ce0                          
>          |           32 |         9,968
>          |- <class> class java.util.HashMap$Node @ 0x4520b7728 System Class   
>          |            8 |            32
>          '- value org.apache.hadoop.hdfs.DistributedFileSystem @ 0x452990178  
>          |           56 |         4,976
>             |- <class> class org.apache.hadoop.hdfs.DistributedFileSystem @ 
> 0x45402e290|            8 |         4,664
>             |- uri java.net.URI @ 0x451a05cd0  hdfs://nameservice1            
>          |           80 |           432
>             |- dfs org.apache.hadoop.hdfs.DFSClient @ 0x451f5d9b8             
>          |          128 |         3,824
>             '- conf org.apache.hadoop.hive.conf.HiveConf @ 0x453a34b38        
>          |           80 |       250,160
> ----------------------------------------------------------------------------------------------------------------------
> {noformat}
> As long as they are in the CACHE they cannot be garbage collected so this 
> leads to a memory leak.
> The memory leak seems to come from the fact the compaction Cleaner attempts 
> to 
> [remove|https://github.com/apache/hive/blob/69e6a5a4151100849d2b03b6b14b1605c3abc3f1/ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/Cleaner.java#L266]
>  the obsolete files and fails. The exception does not allow the [filesystem 
> cleanup|https://github.com/apache/hive/blob/69e6a5a4151100849d2b03b6b14b1605c3abc3f1/ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/Cleaner.java#L270]
>  to take place so we are leaving filesystem entries in the CACHE and 
> subsequently configuration objects.
> Although, the HMS unresponsiveness in this use-case may not be due to lack of 
> memory the leak needs to be addressed to avoid hitting OOM.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Work logged] (HIVE-26404) HMS memory leak when compaction cleaner fails to remove obsolete files

Reply via email to