Gen Luo created FLINK-29927:
-------------------------------

             Summary: AkkaUtils#getAddress may cause memory leak
                 Key: FLINK-29927
                 URL: https://issues.apache.org/jira/browse/FLINK-29927
             Project: Flink
          Issue Type: Bug
            Reporter: Gen Luo
         Attachments: 截屏2022-11-08 下午5.19.38.png

We found a slow memory leak in JM. When MetricFetcherImpl tries to retrieve 
metrics, it always call MetricQueryServiceRetriever#retrieveService first. And 
the method will acquire the address of a task manager, which will use 
AkkaUtil#getAddress internally. While the getAddress method is implemented like 
this:

{code:java}
    public static Address getAddress(ActorSystem system) {
        return new RemoteAddressExtension().apply(system).getAddress();
    }
{code}

and the RemoteAddressExtension#apply is like this:

{code:scala}
  def apply(system: ActorSystem): T = {
    java.util.Objects.requireNonNull(system, "system must not be 
null!").registerExtension(this)
  }
{code}

This means every call of AkkaUtils#getAddress will register a new extension to 
the ActorSystem, and can never be released until the ActorSystem exits.

Most of the usage of the method are called only once while initializing, but as 
described above, MetricFetcherImpl will also use the method. It can happens 
periodically while users open the WebUI, or happens when the users call the 
RESTful API directly to get metrics. This means the memory may keep leaking. 

The leak may be introduced in FLINK-23662 when porting the scala version of 
AkkaUtils to the java one, while I'm not sure if the scala version has the same 
issue.

The leak seems very slow. We observed it on a job running for more than one 
month with only 1G memory for job manager. So I suppose it's not an emergency 
one but still needs to fix.




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to