[ 
https://issues.apache.org/jira/browse/HADOOP-18324?focusedWorklogId=790692&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-790692
 ]

ASF GitHub Bot logged work on HADOOP-18324:
-------------------------------------------

                Author: ASF GitHub Bot
            Created on: 14/Jul/22 03:55
            Start Date: 14/Jul/22 03:55
    Worklog Time Spent: 10m 
      Work Description: ZanderXu commented on PR #4527:
URL: https://github.com/apache/hadoop/pull/4527#issuecomment-1183957096

   @omalley We encountered an incident in our prod environment that relate to 
connection. Limited the `rpcRequestQueue` can fix this problem and I'm looking 
for your good ideas.
   
   And the root cause is that NameNode OOM caused by many pending sending 
requests in connection. 
   
   - The network between Observer NameNode and JournalNode 1 is abnormal, such 
as lag, tcp drop. 
   - The connection is not interrupted, but NameNode can not send requests or 
receive response from this connection.
   - ObserverNameNode will always send `getJournaledEdits` RPC to JN1 and 
ObserverNameNode can ignore the response of the JN1 because it has received 
quorum responses.
   - ObserverNameNode try to ignore the abnormal response by interrupt it. But 
it can not able to interrupt this connection.
   - In the end, NameNode OOM because there are too many pending Requests in 
this abnormal connection.
   
   So I feel that maybe we can limit the `rpcRequestQueue`, and we can throw 
IOException when `rpcRequestQueue` is full.
   
   




Issue Time Tracking
-------------------

    Worklog Id:     (was: 790692)
    Time Spent: 2.5h  (was: 2h 20m)

> Interrupting RPC Client calls can lead to thread exhaustion
> -----------------------------------------------------------
>
>                 Key: HADOOP-18324
>                 URL: https://issues.apache.org/jira/browse/HADOOP-18324
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: ipc
>    Affects Versions: 3.4.0, 2.10.2, 3.3.3
>            Reporter: Owen O'Malley
>            Assignee: Owen O'Malley
>            Priority: Critical
>              Labels: pull-request-available
>          Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> Currently the IPC client creates a boundless number of threads to write the 
> rpc request to the socket. The NameNode uses timeouts on its RPC calls to the 
> Journal Node and a stuck JN will cause the NN to create an infinite set of 
> threads.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to