[ 
https://issues.apache.org/jira/browse/FLINK-15239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16998925#comment-16998925
 ] 

Rui Li commented on FLINK-15239:
--------------------------------

I tried moving the Hadoop dependencies to parent class loader. While it solves 
the thread leak, TM still hits metaspace OOM after several (more) queries are 
executed. By checking the heap dump I find the child class loaders are retained 
by {{CompileUtils::COMPILED_CACHE}}. Although this cache has a maximum limit on 
its size, it can still take lot of space because each loader can hold lots of 
class instances. I tried making the cache use weak/soft references and verified 
it solves the OOM in my local env.

> TM Metaspace memory leak
> ------------------------
>
>                 Key: FLINK-15239
>                 URL: https://issues.apache.org/jira/browse/FLINK-15239
>             Project: Flink
>          Issue Type: Bug
>          Components: Table SQL / Runtime
>    Affects Versions: 1.10.0
>            Reporter: Rui Li
>            Priority: Blocker
>              Labels: pull-request-available
>             Fix For: 1.10.0
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> Start a standalone cluster and then submit multiple queries for Hive tables 
> via SQL CLI. Hive connector dependencies are specified via the {{library}} 
> option. TM will fail eventually with:
> {noformat}
> 2019-12-13 15:11:03,698 INFO  org.apache.flink.runtime.taskmanager.Task       
>               - Source: Values(tuples=[[{ 4.3 }]], values=[EXPR$0]) -> 
> SinkConversionToRow -> Sink: Unnamed (1/1) (b9f9667f686fd97c1c5af65b8b163c44) 
> switched from RUNNING to FAILED.
> java.lang.OutOfMemoryError: Metaspace
>         at java.lang.ClassLoader.defineClass1(Native Method)
>         at java.lang.ClassLoader.defineClass(ClassLoader.java:763)
>         at 
> java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
>         at java.net.URLClassLoader.defineClass(URLClassLoader.java:467)
>         at java.net.URLClassLoader.access$100(URLClassLoader.java:73)
>         at java.net.URLClassLoader$1.run(URLClassLoader.java:368)
>         at java.net.URLClassLoader$1.run(URLClassLoader.java:362)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at java.net.URLClassLoader.findClass(URLClassLoader.java:361)
>         at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>         at 
> org.apache.flink.util.ChildFirstClassLoader.loadClass(ChildFirstClassLoader.java:60)
>         at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>         at 
> org.apache.hadoop.hdfs.NameNodeProxies.createNNProxyWithClientProtocol(NameNodeProxies.java:439)
>         at 
> org.apache.hadoop.hdfs.NameNodeProxies.createNonHAProxy(NameNodeProxies.java:324)
>         at 
> org.apache.hadoop.hdfs.NameNodeProxies.createProxy(NameNodeProxies.java:176)
>         at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:687)
>         at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:628)
>         at 
> org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:149)
>         at 
> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)
>         at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:93)
>         at 
> org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2701)
>         at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2683)
>         at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:372)
>         at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
>         at 
> org.apache.flink.connectors.hive.HadoopFileSystemFactory.create(HadoopFileSystemFactory.java:46)
>         at 
> org.apache.flink.table.filesystem.PartitionTempFileManager.<init>(PartitionTempFileManager.java:73)
>         at 
> org.apache.flink.table.filesystem.FileSystemOutputFormat.open(FileSystemOutputFormat.java:104)
>         at 
> org.apache.flink.streaming.api.functions.sink.OutputFormatSinkFunction.open(OutputFormatSinkFunction.java:65)
>         at 
> org.apache.flink.api.common.functions.util.FunctionUtils.openFunction(FunctionUtils.java:36)
>         at 
> org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.open(AbstractUdfStreamOperator.java:102)
>         at 
> org.apache.flink.streaming.api.operators.StreamSink.open(StreamSink.java:48)
>         at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.initializeStateAndOpen(StreamTask.java:1018)
> {noformat}
> Even for the succeeded queries, TM prints the following errors:
> {noformat}
> Exception in thread "LeaseRenewer:lirui@localhost:8020" 
> java.lang.NoClassDefFoundError: org/apache/hadoop/hdfs/LeaseRenewer$2
>         at org.apache.hadoop.hdfs.LeaseRenewer.renew(LeaseRenewer.java:412)
>         at org.apache.hadoop.hdfs.LeaseRenewer.run(LeaseRenewer.java:448)
>         at 
> org.apache.hadoop.hdfs.LeaseRenewer.access$700(LeaseRenewer.java:71)
>         at org.apache.hadoop.hdfs.LeaseRenewer$1.run(LeaseRenewer.java:304)
>         at java.lang.Thread.run(Thread.java:748)
> Caused by: java.lang.ClassNotFoundException: 
> org.apache.hadoop.hdfs.LeaseRenewer$2
>         at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
>         at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>         at 
> org.apache.flink.util.ChildFirstClassLoader.loadClass(ChildFirstClassLoader.java:60)
>         at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>         ... 5 more
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to