Stamatis Zampetakis created HIVE-26404:
------------------------------------------
Summary: HMS memory leak when compaction cleaner fails to remove
obsolete files
Key: HIVE-26404
URL: https://issues.apache.org/jira/browse/HIVE-26404
Project: Hive
Issue Type: Bug
Components: Metastore
Affects Versions: 4.0.0-alpha-1
Reporter: Stamatis Zampetakis
Assignee: Stamatis Zampetakis
While investigating an issue where HMS becomes unresponsive we noticed a lot of
failed attempts from the compaction Cleaner thread to remove obsolete
directories with exceptions similar to the one below.
{noformat}
2022-06-16 05:48:24,819 ERROR org.apache.hadoop.hive.ql.txn.compactor.Cleaner:
[Cleaner-executor-thread-0]: Caught exception when cleaning, unable to complete
cleaning of
id:4410976,dbname:my_database,tableName:my_table,partName:day=20220502,state:,type:MAJOR,enqueueTime:0,start:0,properties:null,runAs:some_user,tooManyAborts:false,hasOldAbort:false,highestWriteId:187502,errorMessage:null
java.io.IOException: Not enough history available for (187502,x). Oldest
available base:
hdfs://nameservice1/warehouse/tablespace/managed/hive/my_database.db/my_table/day=20220502/base_0188687_v4297872
at
org.apache.hadoop.hive.ql.io.AcidUtils.getAcidState(AcidUtils.java:1432)
at
org.apache.hadoop.hive.ql.txn.compactor.Cleaner.removeFiles(Cleaner.java:261)
at
org.apache.hadoop.hive.ql.txn.compactor.Cleaner.access$000(Cleaner.java:71)
at
org.apache.hadoop.hive.ql.txn.compactor.Cleaner$1.run(Cleaner.java:203)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1898)
at
org.apache.hadoop.hive.ql.txn.compactor.Cleaner.clean(Cleaner.java:200)
at
org.apache.hadoop.hive.ql.txn.compactor.Cleaner.lambda$run$0(Cleaner.java:105)
at
org.apache.hadoop.hive.ql.txn.compactor.CompactorUtil$ThrowingRunnable.lambda$unchecked$0(CompactorUtil.java:54)
at
java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1640)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
{noformat}
In addition the logs contained a large number of long JVM pauses as shown below
and the HMS (RSZ) memory kept increasing at rate of 90MB per hour.
{noformat}
2022-06-16 16:17:17,805 WARN
org.apache.hadoop.hive.metastore.metrics.JvmPauseMonitor:
[org.apache.hadoop.hive.metastore.metrics.JvmPauseMonitor$Monitor@5b022296]:
Detected pause in JVM or host machine (eg GC): pause of approximately 34346ms
2022-06-16 16:17:21,497 INFO
org.apache.hadoop.hive.metastore.metrics.JvmPauseMonitor:
[org.apache.hadoop.hive.metastore.metrics.JvmPauseMonitor$Monitor@5b022296]:
Detected pause in JVM or host machine (eg GC): pause of approximately 1690ms
2022-06-16 16:17:57,696 WARN
org.apache.hadoop.hive.metastore.metrics.JvmPauseMonitor:
[org.apache.hadoop.hive.metastore.metrics.JvmPauseMonitor$Monitor@5b022296]:
Detected pause in JVM or host machine (eg GC): pause of approximately 34697ms
2022-06-16 16:18:01,326 INFO
org.apache.hadoop.hive.metastore.metrics.JvmPauseMonitor:
[org.apache.hadoop.hive.metastore.metrics.JvmPauseMonitor$Monitor@5b022296]:
Detected pause in JVM or host machine (eg GC): pause of approximately 1628ms
2022-06-16 16:18:37,280 WARN
org.apache.hadoop.hive.metastore.metrics.JvmPauseMonitor:
[org.apache.hadoop.hive.metastore.metrics.JvmPauseMonitor$Monitor@5b022296]:
Detected pause in JVM or host machine (eg GC): pause of approximately 34453ms
2022-06-16 16:18:40,927 INFO
org.apache.hadoop.hive.metastore.metrics.JvmPauseMonitor:
[org.apache.hadoop.hive.metastore.metrics.JvmPauseMonitor$Monitor@5b022296]:
Detected pause in JVM or host machine (eg GC): pause of approximately 1646ms
2022-06-16 16:19:16,929 WARN
org.apache.hadoop.hive.metastore.metrics.JvmPauseMonitor:
[org.apache.hadoop.hive.metastore.metrics.JvmPauseMonitor$Monitor@5b022296]:
Detected pause in JVM or host machine (eg GC): pause of approximately 33997ms
2022-06-16 16:19:20,572 INFO
org.apache.hadoop.hive.metastore.metrics.JvmPauseMonitor:
[org.apache.hadoop.hive.metastore.metrics.JvmPauseMonitor$Monitor@5b022296]:
Detected pause in JVM or host machine (eg GC): pause of approximately 1637ms
2022-06-16 16:20:01,643 WARN
org.apache.hadoop.hive.metastore.metrics.JvmPauseMonitor:
[org.apache.hadoop.hive.metastore.metrics.JvmPauseMonitor$Monitor@5b022296]:
Detected pause in JVM or host machine (eg GC): pause of approximately 39329ms
2022-06-16 16:20:05,572 INFO
org.apache.hadoop.hive.metastore.metrics.JvmPauseMonitor:
[org.apache.hadoop.hive.metastore.metrics.JvmPauseMonitor$Monitor@5b022296]:
Detected pause in JVM or host machine (eg GC): pause of approximately 1927ms
{noformat}
We took a heapdump of the HMS around the time that it becomes unresponsive and
we have seen many Configuration objects (~40K) occupying more than 90% of the
current heap (~9GB).
{noformat}
Class Name | Objects | Shallow Heap |
Retained Heap
----------------------------------------------------------------------------------------------
org.apache.hadoop.conf.Configuration | 39,452 | 1,893,696 | >=
8,560,573,960
java.util.concurrent.ConcurrentHashMap | 155,863 | 9,975,232 | >=
4,696,003,968
java.util.concurrent.ConcurrentHashMap$Node[]| 139,348 | 1,312,967,944 | >=
4,686,230,296
java.util.Properties | 87,119 | 4,181,712 | >=
4,193,638,904
java.util.Hashtable$Entry[] | 87,840 | 987,968,472 | >=
4,189,518,928
java.util.concurrent.ConcurrentHashMap$Node | 99,097,078 | 3,171,106,496 | >=
3,375,319,552
java.util.Hashtable$Entry | 100,047,081 | 3,201,506,592 | >=
3,201,551,936
org.postgresql.jdbc.PgConnection | 6,488 | 830,464 |
>= 551,442,952
----------------------------------------------------------------------------------------------
{noformat}
It turns out that these Configuration objects are all referenced by CACHE
entries in org.apache.hadoop.fs.FileSystem$Cache.
{noformat}
Class Name
| Shallow Heap | Retained Heap
----------------------------------------------------------------------------------------------------------------------
org.apache.hadoop.fs.FileSystem$Cache @ 0x45403fe70
| 32 | 108,671,824
|- <class> class org.apache.hadoop.fs.FileSystem$Cache @ 0x45410c3e0
| 8 | 544
'- map java.util.HashMap @ 0x453ffb598
| 48 | 92,777,232
|- <class> class java.util.HashMap @ 0x4520382c8 System Class
| 40 | 168
|- entrySet java.util.HashMap$EntrySet @ 0x454077848
| 16 | 16
'- table java.util.HashMap$Node[32768] @ 0x463585b68
| 131,088 | 92,777,168
|- class java.util.HashMap$Node[] @ 0x4520b7790
| 0 | 0
'- [1786] java.util.HashMap$Node @ 0x451998ce0
| 32 | 9,968
|- <class> class java.util.HashMap$Node @ 0x4520b7728 System Class
| 8 | 32
'- value org.apache.hadoop.hdfs.DistributedFileSystem @ 0x452990178
| 56 | 4,976
|- <class> class org.apache.hadoop.hdfs.DistributedFileSystem @
0x45402e290| 8 | 4,664
|- uri java.net.URI @ 0x451a05cd0 hdfs://nameservice1
| 80 | 432
|- dfs org.apache.hadoop.hdfs.DFSClient @ 0x451f5d9b8
| 128 | 3,824
'- conf org.apache.hadoop.hive.conf.HiveConf @ 0x453a34b38
| 80 | 250,160
----------------------------------------------------------------------------------------------------------------------
{noformat}
As long as they are in the CACHE they cannot be garbage collected so this leads
to a memory leak.
The memory leak seems to come from the fact the compaction Cleaner attempts to
[remove|https://github.com/apache/hive/blob/69e6a5a4151100849d2b03b6b14b1605c3abc3f1/ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/Cleaner.java#L266]
the obsolete files and fails. The exception does not allow the [filesystem
cleanup|https://github.com/apache/hive/blob/69e6a5a4151100849d2b03b6b14b1605c3abc3f1/ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/Cleaner.java#L270]
to take place so we are leaving filesystem entries in the CACHE and
subsequently configuration objects.
Although, the HMS unresponsiveness in this use-case may not be due to lack of
memory the leak needs to be addressed to avoid hitting OOM.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)