[
https://issues.apache.org/jira/browse/SPARK-39357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
tianshuang updated SPARK-39357:
-------------------------------
Description:
I found this bug in Spark 2.4.4, because the related code has not changed, so
this bug still exists on master, the following is a brief description of this
bug:
In May 2015,
[SPARK-6907|https://github.com/apache/spark/commit/daa70bf135f23381f5f410aa95a1c0e5a2888568]
introduced isolated classloader for HiveMetastore to support Hive
multi-version loading, but this PR resulted in [RawStore cleanup
mechanism](https://github.com/apache/spark/blob/master/sql/hive-thriftserver/src/main/java/org/apache/hive/service/server/ThreadFactoryWithGarbageCleanup.java
#L27-L42) is broken because the `ThreadWithGarbageCleanup` class used by
`HiveServer2-Handler-Pool` and `HiveServer2-Background-Pool` and
`HiveServer2-HttpHandler-Pool` is loaded by AppClassLoader, in the source code
of `ThreadWithGarbageCleanup` class: `RawStore threadLocalRawStore =
HiveMetaStore.HMSHandler.getRawStore();` This line of code will use the
`threadLocalMS` instance in `HiveMetaStore.HMSHandler` (loaded by
AppClassLoader), and in the process of thread execution, the `client` actually
created by isolatedClassLoader, in the process of obtaining `RawStore` instance
through `HiveMetaStore.HMSHandler#getMSForConf`, the `ms` instance is set to
`threadLocalMS`, but the static `threadLocalMS` instance belongs to
`HMSHandler`(loaded by IsolatedClassLoader$$anon$1), that is, the set and get
methods do not operate on the same `threadLocalMS` instance, so in
`ThreadWithGarbageCleanup#cacheThreadLocalRawStore` method, the obtained
`RawStore` instance is null, so the subsequent `RawStore` cleaning logic does
not take effect, because the `shutdown` method of `RawStore` instance is not
called, resulting in `pmCache` of `JDOPersistenceManagerFactory` memory leak.
Long-running Spark ThriftServer end up with frequent GCs, resulting in poor
performance.
I analyzed the heap dump using MAT and executed the following OQL: `SELECT *
FROM INSTANCEOF java.lang.Class c WHERE [email protected]("class
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler ")`, two instances of
the `HMSHandler` **Class** can be found in the heap. Also know that they each
hold a static `threadLocalMS` instance.
We execute the following OQL: `select * from
org.datanucleus.api.jdo.JDOPersistenceManagerFactory`, we can see that the
`pmCache` of the `JDOPersistenceManagerFactory` instance occupies 1.3GB of
memory.
We execute the following OQL: `SELECT * FROM INSTANCEOF java.lang.Class c WHERE
[email protected]("class
org.apache.hive.service.server.ThreadFactoryWithGarbageCleanup")`, we can see
that there is no element in the static instance `threadRawStoreMap` of
`ThreadFactoryWithGarbageCleanup`, which confirms the above statement, because
`HMSHandler.getRawStore()` in
`ThreadWithGarbageCleanup#cacheThreadLocalRawStore` is called on the
`threadLocalMS` instance in `HMSHandler`(loaded by AppClassLoader) instead of
`threadLocalMS` instance in `HMSHandler`(loaded by IsolatedClassLoader$$anon$1).
was:
I found this bug in Spark 2.4.4, because the related code has not changed, so
this bug still exists on master, the following is a brief description of this
bug:
In May 2015,
[SPARK-6907](https://github.com/apache/spark/commit/daa70bf135f23381f5f410aa95a1c0e5a2888568)
introduced isolated classloader for HiveMetastore to support Hive
multi-version loading, but this PR resulted in [RawStore cleanup
mechanism](https://github.com/apache/spark/blob/master/sql/hive-thriftserver/src/main/java/org/apache/hive/service/server/ThreadFactoryWithGarbageCleanup.java
#L27-L42) is broken because the `ThreadWithGarbageCleanup` class used by
`HiveServer2-Handler-Pool` and `HiveServer2-Background-Pool` and
`HiveServer2-HttpHandler-Pool` is loaded by AppClassLoader, in the source code
of `ThreadWithGarbageCleanup` class: `RawStore threadLocalRawStore =
HiveMetaStore.HMSHandler.getRawStore();` This line of code will use the
`threadLocalMS` instance in `HiveMetaStore.HMSHandler` (loaded by
AppClassLoader), and in the process of thread execution, the `client` actually
created by isolatedClassLoader, in the process of obtaining `RawStore` instance
through `HiveMetaStore.HMSHandler#getMSForConf`, the `ms` instance is set to
`threadLocalMS`, but the static `threadLocalMS` instance belongs to
`HMSHandler`(loaded by IsolatedClassLoader$$anon$1), that is, the set and get
methods do not operate on the same `threadLocalMS` instance, so in
`ThreadWithGarbageCleanup#cacheThreadLocalRawStore` method, the obtained
`RawStore` instance is null, so the subsequent `RawStore` cleaning logic does
not take effect, because the `shutdown` method of `RawStore` instance is not
called, resulting in `pmCache` of `JDOPersistenceManagerFactory` memory leak.
Long-running Spark ThriftServer end up with frequent GCs, resulting in poor
performance.
I analyzed the heap dump using MAT and executed the following OQL: `SELECT *
FROM INSTANCEOF java.lang.Class c WHERE [email protected]("class
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler ")`, two instances of
the `HMSHandler` **Class** can be found in the heap. Also know that they each
hold a static `threadLocalMS` instance.
We execute the following OQL: `select * from
org.datanucleus.api.jdo.JDOPersistenceManagerFactory`, we can see that the
`pmCache` of the `JDOPersistenceManagerFactory` instance occupies 1.3GB of
memory.
We execute the following OQL: `SELECT * FROM INSTANCEOF java.lang.Class c WHERE
[email protected]("class
org.apache.hive.service.server.ThreadFactoryWithGarbageCleanup")`, we can see
that there is no element in the static instance `threadRawStoreMap` of
`ThreadFactoryWithGarbageCleanup`, which confirms the above statement, because
`HMSHandler.getRawStore()` in
`ThreadWithGarbageCleanup#cacheThreadLocalRawStore` is called on the
`threadLocalMS` instance in `HMSHandler`(loaded by AppClassLoader) instead of
`threadLocalMS` instance in `HMSHandler`(loaded by IsolatedClassLoader$$anon$1).
> pmCache memory leak caused by IsolatedClassLoader
> -------------------------------------------------
>
> Key: SPARK-39357
> URL: https://issues.apache.org/jira/browse/SPARK-39357
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 2.4.4, 3.2.1
> Reporter: tianshuang
> Priority: Major
> Attachments: Xnip2022-06-01_23-09-35.jpg,
> Xnip2022-06-01_23-19-35.jpeg, Xnip2022-06-01_23-32-39.jpg
>
>
> I found this bug in Spark 2.4.4, because the related code has not changed, so
> this bug still exists on master, the following is a brief description of this
> bug:
> In May 2015,
> [SPARK-6907|https://github.com/apache/spark/commit/daa70bf135f23381f5f410aa95a1c0e5a2888568]
> introduced isolated classloader for HiveMetastore to support Hive
> multi-version loading, but this PR resulted in [RawStore cleanup
> mechanism](https://github.com/apache/spark/blob/master/sql/hive-thriftserver/src/main/java/org/apache/hive/service/server/ThreadFactoryWithGarbageCleanup.java
> #L27-L42) is broken because the `ThreadWithGarbageCleanup` class used by
> `HiveServer2-Handler-Pool` and `HiveServer2-Background-Pool` and
> `HiveServer2-HttpHandler-Pool` is loaded by AppClassLoader, in the source
> code of `ThreadWithGarbageCleanup` class: `RawStore threadLocalRawStore =
> HiveMetaStore.HMSHandler.getRawStore();` This line of code will use the
> `threadLocalMS` instance in `HiveMetaStore.HMSHandler` (loaded by
> AppClassLoader), and in the process of thread execution, the `client`
> actually created by isolatedClassLoader, in the process of obtaining
> `RawStore` instance through `HiveMetaStore.HMSHandler#getMSForConf`, the `ms`
> instance is set to `threadLocalMS`, but the static `threadLocalMS` instance
> belongs to `HMSHandler`(loaded by IsolatedClassLoader$$anon$1), that is, the
> set and get methods do not operate on the same `threadLocalMS` instance, so
> in `ThreadWithGarbageCleanup#cacheThreadLocalRawStore` method, the obtained
> `RawStore` instance is null, so the subsequent `RawStore` cleaning logic does
> not take effect, because the `shutdown` method of `RawStore` instance is not
> called, resulting in `pmCache` of `JDOPersistenceManagerFactory` memory leak.
> Long-running Spark ThriftServer end up with frequent GCs, resulting in poor
> performance.
> I analyzed the heap dump using MAT and executed the following OQL: `SELECT *
> FROM INSTANCEOF java.lang.Class c WHERE [email protected]("class
> org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler ")`, two instances
> of the `HMSHandler` **Class** can be found in the heap. Also know that they
> each hold a static `threadLocalMS` instance.
> We execute the following OQL: `select * from
> org.datanucleus.api.jdo.JDOPersistenceManagerFactory`, we can see that the
> `pmCache` of the `JDOPersistenceManagerFactory` instance occupies 1.3GB of
> memory.
> We execute the following OQL: `SELECT * FROM INSTANCEOF java.lang.Class c
> WHERE [email protected]("class
> org.apache.hive.service.server.ThreadFactoryWithGarbageCleanup")`, we can see
> that there is no element in the static instance `threadRawStoreMap` of
> `ThreadFactoryWithGarbageCleanup`, which confirms the above statement,
> because `HMSHandler.getRawStore()` in
> `ThreadWithGarbageCleanup#cacheThreadLocalRawStore` is called on the
> `threadLocalMS` instance in `HMSHandler`(loaded by AppClassLoader) instead of
> `threadLocalMS` instance in `HMSHandler`(loaded by
> IsolatedClassLoader$$anon$1).
--
This message was sent by Atlassian Jira
(v8.20.7#820007)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]