[jira] [Updated] (HADOOP-16403) Start a new statistical rpc queue and make the Reader's pendingConnection queue runtime-replaceable

Jinglun (JIRA) Mon, 01 Jul 2019 02:14:20 -0700


     [ 
https://issues.apache.org/jira/browse/HADOOP-16403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Jinglun updated HADOOP-16403:
-----------------------------
    Attachment: HADOOP-16403.001.patch
        Status: Patch Available  (was: Open)

patch 001 shows my thoughts about the statistical queue and making the reader 
queue run-time swapped. I move the swap ability from CallQueueManger to a new 
class SwapQueueManager and make CallQueueManager a subclass of it, so I can 
make reader queue run-time swapped. I also add a new class 
MetricLinkedBlockingQueue to compute qps and write queue-full log.

> Start a new statistical rpc queue and make the Reader's pendingConnection 
> queue runtime-replaceable
> ---------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-16403
>                 URL: https://issues.apache.org/jira/browse/HADOOP-16403
>             Project: Hadoop Common
>          Issue Type: Improvement
>            Reporter: Jinglun
>            Priority: Major
>         Attachments: HADOOP-16403.001.patch
>
>
> I have an HA cluster with 2 NameNodes. The NameNode's meta is quite big so 
> after the active dead, it takes the standby more than 40s to become active. 
> Many requests(tcp connect request and rpc request) from Datanodes, clients 
> and zkfc timed out and start retrying. The suddenly request flood lasts for 
> the next 2 minutes and finally all requests are either handled or run out of 
> retry times. 
> Adjusting the rpc related settings might power the NameNode and solve this 
> problem and the key point is finding the bottle neck. The rpc server can be 
> described as below:
> {noformat}
> Listener -> Readers' queues -> Readers -> callQueue -> Handlers{noformat}
> By sampling some failed clients, I find many of them got ConnectException. 
> It's caused by a 20s un-responded tcp connect request. I think may be the 
> reader queue is full and block the listener from handling new connections. 
> Both slow handlers and slow readers can block the whole processing progress, 
> and I need to know who it is. I think *a queue that computes the qps, write 
> log when the queue is full and could be replaced easily* will help. 
> I find the nice work HADOOP-10302 implementing a runtime-swapped queue. Using 
> it at Reader's queue makes the reader queue runtime-swapped automatically. 
> The qps computing job could be done by implementing a subclass of 
> LinkedBlockQueue that does the computing job while put/take/... happens. The 
> qps data will show on jmx.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (HADOOP-16403) Start a new statistical rpc queue and make the Reader's pendingConnection queue runtime-replaceable

Reply via email to