[
https://issues.apache.org/jira/browse/HDFS-6967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14122937#comment-14122937
]
Daryn Sharp commented on HDFS-6967:
-----------------------------------
Popular in the sense that wide jobs needed access to ~3 replicas, yes. Jobs
with 2-5k mappers that ran fine with hdfs cause significant problems with
webhdfs. We saw up to 20k jetty requests queued up while the DN was in the
throes of GC. Presumably the client is timing out and retrying the connection
again, which further exacerbates the memory problem as the jetty queue is full
of heavy objects for dead connections.
We could take the existing DN load into account, such that we pick the lightest
xceiver loaded DN before picking a random rack local. However the risk is that
an onslaught of tasks will be distributed to the DNs with the replica before
the DN can report their spikes in load. This is already a minor issue for
hdfs. In the past we've considered using a "predictive" load where the NN
artificially increases the DN's load stats as it gives out locations. In the
end, we chose to keep it simple with random rack-local.
> DNs may OOM under high webhdfs load
> -----------------------------------
>
> Key: HDFS-6967
> URL: https://issues.apache.org/jira/browse/HDFS-6967
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: datanode, webhdfs
> Affects Versions: 2.0.0-alpha, 3.0.0
> Reporter: Daryn Sharp
> Assignee: Eric Payne
>
> Webhdfs uses jetty. The size of the request thread pool is limited, but
> jetty will accept and queue infinite connections. Every queued connection is
> "heavy" with buffers, etc. Unlike data streamer connections, thousands of
> webhdfs connections will quickly OOM a DN. The accepted requests must be
> bounded and excess clients rejected so they retry on a new DN.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)