[jira] [Created] (TEZ-4070) SSLFactory not closed in DAGClientTimelineImpl caused native memory issues

Xun REN (JIRA) Fri, 17 May 2019 06:06:35 -0700

Xun REN created TEZ-4070:
----------------------------

             Summary: SSLFactory not closed in DAGClientTimelineImpl caused 
native memory issues
                 Key: TEZ-4070
                 URL: https://issues.apache.org/jira/browse/TEZ-4070
             Project: Apache Tez
          Issue Type: Bug
    Affects Versions: 0.7.0
            Reporter: Xun REN



Hi,

Recently, we're facing native memory issues on Redhat servers. It crashed 
completely our servers. 

*Context:*

- HDP-2.6.5 

- Redhat 7.4

*Problem:*

After upgrading from HDP-2.6.2 to HDP-2.6.5, after several days running, our 
HiveServer2 can eat up to more than 100GB memory. However, we have configured 
Xmx20G and MaxMetaspace to 10GB.

After searching, we have found the similar issue here:

https://issues.apache.org/jira/browse/YARN-5309

This is fixed in the hadoop-common module. Our version includes already this 
issue, however, we still have the problem.

After searching, I have found that in the class 
org.apache.tez.dag.api.client.TimelineReaderFactory of Tez, if HTTPS is used 
for YARN, it will create SSLFactory which is not destroyed after utilization.

TimelineReaderFactory is referenced in the class DAGClientTimelineImpl.

If ATS is used and DAG is completed, the method switchToTimelineClient in the 
class DAGClientImpl will be called. It will close the previous HTTPClient, but 
not the SSLFactory inside. And the SSLFactory will create a thread for each 
connection. Finally, we will get thousands of threads consuming a lot native 
memories.
{code:java}
private void switchToTimelineClient() throws IOException, TezException {
 realClient.close();
 realClient = new DAGClientTimelineImpl(appId, dagId, conf, frameworkClient,
 (int) (2 * PRINT_STATUS_INTERVAL_MILLIS));
 if (LOG.isDebugEnabled()) {
 LOG.debug("dag completed switching to DAGClientTimelineImpl");
 }
}{code}
I have checked on another environment which is still on HDP-2.6.2, we also have 
a lot of running threads holding by SSLFactory. That means the problem is 
zoomed in the version HDP-2.6.5

 

*How to reproduce the problem:*

1. Use Tez as Hive execution engine

2. Launch a Beeline session for Hive

3. Do a select with a simple where clause on a table

4. Repeat steps 2-3 in order to open different connections (it is the case for 
a shared cluster with multiple clients).

Finally, you can check in the thread dump file, that a lot of threads are named 
"Truststore reloader thread". And the native memory usage is very high by doing 
the command "top" or "ps".

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (TEZ-4070) SSLFactory not closed in DAGClientTimelineImpl caused native memory issues

Reply via email to