[ 
https://issues.apache.org/jira/browse/FLINK-15024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yingjie Cao updated FLINK-15024:
--------------------------------
    Description: 
We are using Flink session cluster as a service for ad-hoc queries. After 
running some queries, we found that the memory usage of TaskManager grows and 
cannot be garbage collected. Eventually, we found that it was the object (class 
name and lock object) in parallelLockMap of AppClassloader and ExtClassloader 
cannot be recycled. And we found the classes were generated ones and should be 
never loaded by system classloader.

The codegen classes are loaded by org.codehaus.janino.ByteArrayClassLoader 
which is a parent first classloader and will rely  on its parent classloader, 
e.g. Flink user classloader to load the class first, flink user classloader 
will also try to load the class with its parent classloader, and finally it 
will reach AppClassloader and ExtClassloader. Both the AppClassloader and 
ExtClassloader are SecureClassLoader and will add class name and a lock object 
to the parallelLockMap when loading a new class.

I think we should never let the system classloader try to load the generated 
classes which is doomed to fail. We need to prune the process of loading 
codegen classes and avoid those classes reaching the system classloader. Two 
ways can achieve that:
 # We give a special prefix to codegen class name and filter class with those 
prefix in Flink user classloader.
 # We implement a new child first classloader which filters the codegen class 
and never loads the codegen class with Flink user classloader and set this 
class loader as the parent classloader of 
org.codehaus.janino.ByteArrayClassLoader instead of the Flink user classloader.

  was:
We are using Flink session cluster as a service for ad-hoc queries. After 
running some queries, we found that the memory usage of TaskManager grows and 
cannot be garbage collected. Eventually, we found that it was the object (class 
name and lock object) in ```parallelLockMap``` of ```AppClassloader``` and 
```ExtClassloader``` cannot be recycled. And we found the classes were 
generated ones and should be never loaded by system classloader.

The codegen classes are loaded by 
```org.codehaus.janino.ByteArrayClassLoader``` which is a parent first 
classloader and will rely  on its parent classloader, e.g. Flink user 
classloader to load the class first, flink user classloader will also try to 
load the class with its parent classloader, and finally it will reach 
```AppClassloader``` and ```ExtClassloader```. Both the ```AppClassloader``` 
and ```ExtClassloader``` are ```SecureClassLoader``` and will add class name 
and a lock object to the ```parallelLockMap``` when loading a new class.

I think we should never let the system classloader try to load the generated 
classes which is doomed to fail. We need to prune the process of loading 
codegen classes and avoid those classes reaching the system classloader. Two 
ways can achieve that:
 # We give a special prefix to codegen class name and filter class with those 
prefix in Flink user classloader.
 # We implement a new child first classloader which filters the codegen class 
and never loads the codegen class with Flink user classloader and set this 
class loader as the parent classloader of 
```org.codehaus.janino.ByteArrayClassLoader``` instead of the Flink user 
classloader.


> System classloader memory leak after loading too many codegen classes.
> ----------------------------------------------------------------------
>
>                 Key: FLINK-15024
>                 URL: https://issues.apache.org/jira/browse/FLINK-15024
>             Project: Flink
>          Issue Type: Bug
>            Reporter: Yingjie Cao
>            Priority: Major
>
> We are using Flink session cluster as a service for ad-hoc queries. After 
> running some queries, we found that the memory usage of TaskManager grows and 
> cannot be garbage collected. Eventually, we found that it was the object 
> (class name and lock object) in parallelLockMap of AppClassloader and 
> ExtClassloader cannot be recycled. And we found the classes were generated 
> ones and should be never loaded by system classloader.
> The codegen classes are loaded by org.codehaus.janino.ByteArrayClassLoader 
> which is a parent first classloader and will rely  on its parent classloader, 
> e.g. Flink user classloader to load the class first, flink user classloader 
> will also try to load the class with its parent classloader, and finally it 
> will reach AppClassloader and ExtClassloader. Both the AppClassloader and 
> ExtClassloader are SecureClassLoader and will add class name and a lock 
> object to the parallelLockMap when loading a new class.
> I think we should never let the system classloader try to load the generated 
> classes which is doomed to fail. We need to prune the process of loading 
> codegen classes and avoid those classes reaching the system classloader. Two 
> ways can achieve that:
>  # We give a special prefix to codegen class name and filter class with those 
> prefix in Flink user classloader.
>  # We implement a new child first classloader which filters the codegen class 
> and never loads the codegen class with Flink user classloader and set this 
> class loader as the parent classloader of 
> org.codehaus.janino.ByteArrayClassLoader instead of the Flink user 
> classloader.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to