[jira] [Created] (SPARK-39357) pmCache memory leak caused by IsolatedClassLoader

tianshuang (Jira) Wed, 01 Jun 2022 08:41:07 -0700

tianshuang created SPARK-39357:
----------------------------------

             Summary: pmCache memory leak caused by IsolatedClassLoader
                 Key: SPARK-39357
                 URL: https://issues.apache.org/jira/browse/SPARK-39357
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 3.2.1, 2.4.4
            Reporter: tianshuang
         Attachments: Xnip2022-06-01_23-09-35.jpg, 
Xnip2022-06-01_23-19-35.jpeg, Xnip2022-06-01_23-32-39.jpg


I found this bug in Spark 2.4.4, because the related code has not changed, so 
this bug still exists on master, the following is a brief description of this 
bug:

In May 2015, 
[SPARK-6907](https://github.com/apache/spark/commit/daa70bf135f23381f5f410aa95a1c0e5a2888568)
 introduced isolated classloader for HiveMetastore to support Hive 
multi-version loading, but this PR resulted in [RawStore cleanup 
mechanism](https://github.com/apache/spark/blob/master/sql/hive-thriftserver/src/main/java/org/apache/hive/service/server/ThreadFactoryWithGarbageCleanup.java
 #L27-L42) is broken because the `ThreadWithGarbageCleanup` class used by 
`HiveServer2-Handler-Pool` and `HiveServer2-Background-Pool` and 
`HiveServer2-HttpHandler-Pool` is loaded by AppClassLoader, in the source code 
of `ThreadWithGarbageCleanup` class: `RawStore threadLocalRawStore = 
HiveMetaStore.HMSHandler.getRawStore();` This line of code will use the 
`threadLocalMS` instance in `HiveMetaStore.HMSHandler` (loaded by 
AppClassLoader), and in the process of thread execution, the `client` actually 
created by isolatedClassLoader, in the process of obtaining `RawStore` instance 
through `HiveMetaStore.HMSHandler#getMSForConf`, the `ms` instance is set to 
`threadLocalMS`, but the static `threadLocalMS` instance belongs to 
`HMSHandler`(loaded by IsolatedClassLoader$$anon$1), that is, the set and get 
methods do not operate on the same `threadLocalMS` instance, so in 
`ThreadWithGarbageCleanup#cacheThreadLocalRawStore` method, the obtained 
`RawStore` instance is null, so the subsequent `RawStore` cleaning logic does 
not take effect, because the `shutdown` method of `RawStore` instance is not 
called, resulting in `pmCache` of `JDOPersistenceManagerFactory` memory leak.

Long-running Spark ThriftServer end up with frequent GCs, resulting in poor 
performance.

I analyzed the heap dump using MAT and executed the following OQL: `SELECT * 
FROM INSTANCEOF java.lang.Class c WHERE [email protected]("class 
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler ")`, two instances of 
the `HMSHandler` **Class** can be found in the heap. Also know that they each 
hold a static `threadLocalMS` instance.

We execute the following OQL: `select * from 
org.datanucleus.api.jdo.JDOPersistenceManagerFactory`, we can see that the 
`pmCache` of the `JDOPersistenceManagerFactory` instance occupies 1.3GB of 
memory.

We execute the following OQL: `SELECT * FROM INSTANCEOF java.lang.Class c WHERE 
[email protected]("class 
org.apache.hive.service.server.ThreadFactoryWithGarbageCleanup")`, we can see 
that there is no element in the static instance `threadRawStoreMap` of 
`ThreadFactoryWithGarbageCleanup`, which confirms the above statement, because 
`HMSHandler.getRawStore()` in 
`ThreadWithGarbageCleanup#cacheThreadLocalRawStore` is called on the 
`threadLocalMS` instance in `HMSHandler`(loaded by AppClassLoader) instead of 
`threadLocalMS` instance in `HMSHandler`(loaded by IsolatedClassLoader$$anon$1).



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Created] (SPARK-39357) pmCache memory leak caused by IsolatedClassLoader

Reply via email to