[
https://issues.apache.org/jira/browse/HADOOP-18324?focusedWorklogId=790692&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-790692
]
ASF GitHub Bot logged work on HADOOP-18324:
-------------------------------------------
Author: ASF GitHub Bot
Created on: 14/Jul/22 03:55
Start Date: 14/Jul/22 03:55
Worklog Time Spent: 10m
Work Description: ZanderXu commented on PR #4527:
URL: https://github.com/apache/hadoop/pull/4527#issuecomment-1183957096
@omalley We encountered an incident in our prod environment that relate to
connection. Limited the `rpcRequestQueue` can fix this problem and I'm looking
for your good ideas.
And the root cause is that NameNode OOM caused by many pending sending
requests in connection.
- The network between Observer NameNode and JournalNode 1 is abnormal, such
as lag, tcp drop.
- The connection is not interrupted, but NameNode can not send requests or
receive response from this connection.
- ObserverNameNode will always send `getJournaledEdits` RPC to JN1 and
ObserverNameNode can ignore the response of the JN1 because it has received
quorum responses.
- ObserverNameNode try to ignore the abnormal response by interrupt it. But
it can not able to interrupt this connection.
- In the end, NameNode OOM because there are too many pending Requests in
this abnormal connection.
So I feel that maybe we can limit the `rpcRequestQueue`, and we can throw
IOException when `rpcRequestQueue` is full.
Issue Time Tracking
-------------------
Worklog Id: (was: 790692)
Time Spent: 2.5h (was: 2h 20m)
> Interrupting RPC Client calls can lead to thread exhaustion
> -----------------------------------------------------------
>
> Key: HADOOP-18324
> URL: https://issues.apache.org/jira/browse/HADOOP-18324
> Project: Hadoop Common
> Issue Type: Bug
> Components: ipc
> Affects Versions: 3.4.0, 2.10.2, 3.3.3
> Reporter: Owen O'Malley
> Assignee: Owen O'Malley
> Priority: Critical
> Labels: pull-request-available
> Time Spent: 2.5h
> Remaining Estimate: 0h
>
> Currently the IPC client creates a boundless number of threads to write the
> rpc request to the socket. The NameNode uses timeouts on its RPC calls to the
> Journal Node and a stuck JN will cause the NN to create an infinite set of
> threads.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]