[ https://issues.apache.org/jira/browse/FLINK-15239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16998925#comment-16998925 ]
Rui Li commented on FLINK-15239: -------------------------------- I tried moving the Hadoop dependencies to parent class loader. While it solves the thread leak, TM still hits metaspace OOM after several (more) queries are executed. By checking the heap dump I find the child class loaders are retained by {{CompileUtils::COMPILED_CACHE}}. Although this cache has a maximum limit on its size, it can still take lot of space because each loader can hold lots of class instances. I tried making the cache use weak/soft references and verified it solves the OOM in my local env. > TM Metaspace memory leak > ------------------------ > > Key: FLINK-15239 > URL: https://issues.apache.org/jira/browse/FLINK-15239 > Project: Flink > Issue Type: Bug > Components: Table SQL / Runtime > Affects Versions: 1.10.0 > Reporter: Rui Li > Priority: Blocker > Labels: pull-request-available > Fix For: 1.10.0 > > Time Spent: 10m > Remaining Estimate: 0h > > Start a standalone cluster and then submit multiple queries for Hive tables > via SQL CLI. Hive connector dependencies are specified via the {{library}} > option. TM will fail eventually with: > {noformat} > 2019-12-13 15:11:03,698 INFO org.apache.flink.runtime.taskmanager.Task > - Source: Values(tuples=[[{ 4.3 }]], values=[EXPR$0]) -> > SinkConversionToRow -> Sink: Unnamed (1/1) (b9f9667f686fd97c1c5af65b8b163c44) > switched from RUNNING to FAILED. > java.lang.OutOfMemoryError: Metaspace > at java.lang.ClassLoader.defineClass1(Native Method) > at java.lang.ClassLoader.defineClass(ClassLoader.java:763) > at > java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) > at java.net.URLClassLoader.defineClass(URLClassLoader.java:467) > at java.net.URLClassLoader.access$100(URLClassLoader.java:73) > at java.net.URLClassLoader$1.run(URLClassLoader.java:368) > at java.net.URLClassLoader$1.run(URLClassLoader.java:362) > at java.security.AccessController.doPrivileged(Native Method) > at java.net.URLClassLoader.findClass(URLClassLoader.java:361) > at java.lang.ClassLoader.loadClass(ClassLoader.java:424) > at > org.apache.flink.util.ChildFirstClassLoader.loadClass(ChildFirstClassLoader.java:60) > at java.lang.ClassLoader.loadClass(ClassLoader.java:357) > at > org.apache.hadoop.hdfs.NameNodeProxies.createNNProxyWithClientProtocol(NameNodeProxies.java:439) > at > org.apache.hadoop.hdfs.NameNodeProxies.createNonHAProxy(NameNodeProxies.java:324) > at > org.apache.hadoop.hdfs.NameNodeProxies.createProxy(NameNodeProxies.java:176) > at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:687) > at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:628) > at > org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:149) > at > org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667) > at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:93) > at > org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2701) > at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2683) > at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:372) > at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295) > at > org.apache.flink.connectors.hive.HadoopFileSystemFactory.create(HadoopFileSystemFactory.java:46) > at > org.apache.flink.table.filesystem.PartitionTempFileManager.<init>(PartitionTempFileManager.java:73) > at > org.apache.flink.table.filesystem.FileSystemOutputFormat.open(FileSystemOutputFormat.java:104) > at > org.apache.flink.streaming.api.functions.sink.OutputFormatSinkFunction.open(OutputFormatSinkFunction.java:65) > at > org.apache.flink.api.common.functions.util.FunctionUtils.openFunction(FunctionUtils.java:36) > at > org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.open(AbstractUdfStreamOperator.java:102) > at > org.apache.flink.streaming.api.operators.StreamSink.open(StreamSink.java:48) > at > org.apache.flink.streaming.runtime.tasks.StreamTask.initializeStateAndOpen(StreamTask.java:1018) > {noformat} > Even for the succeeded queries, TM prints the following errors: > {noformat} > Exception in thread "LeaseRenewer:lirui@localhost:8020" > java.lang.NoClassDefFoundError: org/apache/hadoop/hdfs/LeaseRenewer$2 > at org.apache.hadoop.hdfs.LeaseRenewer.renew(LeaseRenewer.java:412) > at org.apache.hadoop.hdfs.LeaseRenewer.run(LeaseRenewer.java:448) > at > org.apache.hadoop.hdfs.LeaseRenewer.access$700(LeaseRenewer.java:71) > at org.apache.hadoop.hdfs.LeaseRenewer$1.run(LeaseRenewer.java:304) > at java.lang.Thread.run(Thread.java:748) > Caused by: java.lang.ClassNotFoundException: > org.apache.hadoop.hdfs.LeaseRenewer$2 > at java.net.URLClassLoader.findClass(URLClassLoader.java:381) > at java.lang.ClassLoader.loadClass(ClassLoader.java:424) > at > org.apache.flink.util.ChildFirstClassLoader.loadClass(ChildFirstClassLoader.java:60) > at java.lang.ClassLoader.loadClass(ClassLoader.java:357) > ... 5 more > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)