[
https://issues.apache.org/jira/browse/HDFS-6967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14121518#comment-14121518
]
Daryn Sharp commented on HDFS-6967:
-----------------------------------
This is becoming a critical issue. Wide and/or many jobs overwhelm DNs (due to
smaller heap) and sometimes the NN. Either can OOM or go into full GC. In the
case of DNs, it can GC so much that the NN declares it dead.
Hadoop's jetty version does not support limiting of connections, in fact the
code contains "README"s that connections should be limited down in the bowels
of private/anonymous classes. It would require some serious risky hacks to
override jetty's behavior.
I'd like to solicit comments on an internal approach we are trying: rather
than redirect clients to a DN with a replica, redirect to a random DN in a rack
with a replica. This distributes the jetty connections, DFSClients, and
associated objects over many more nodes with an insignificant impact to
throughput - esp. over higher latency links and compared to the time it takes
for the client to timeout the connection and retry another node which may also
be overloaded.
> DNs may OOM under high webhdfs load
> -----------------------------------
>
> Key: HDFS-6967
> URL: https://issues.apache.org/jira/browse/HDFS-6967
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: datanode, webhdfs
> Affects Versions: 2.0.0-alpha, 3.0.0
> Reporter: Daryn Sharp
> Assignee: Eric Payne
>
> Webhdfs uses jetty. The size of the request thread pool is limited, but
> jetty will accept and queue infinite connections. Every queued connection is
> "heavy" with buffers, etc. Unlike data streamer connections, thousands of
> webhdfs connections will quickly OOM a DN. The accepted requests must be
> bounded and excess clients rejected so they retry on a new DN.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)