[
https://issues.apache.org/jira/browse/HADOOP-16403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16891407#comment-16891407
]
Wei-Chiu Chuang commented on HADOOP-16403:
------------------------------------------
Thanks for working on this, [~LiJinglun].
Is this planned as a troubleshooting tool where you can remove it at runtime?
Or is this used for benchmarking (i.e. not in production environment)?
For the test result doc:
{quote}MetricLinkedBlockingQueue(without log)
{quote}
What does "without log" mean? Do you mean run the same test, but removing the
log messages added in the patch (or set the log level to WARN)?
The patch adds a new dfsadmin option {{-refreshReaderQueue}}. Please update the
doc to introduce this option.
> Start a new statistical rpc queue and make the Reader's pendingConnection
> queue runtime-replaceable
> ---------------------------------------------------------------------------------------------------
>
> Key: HADOOP-16403
> URL: https://issues.apache.org/jira/browse/HADOOP-16403
> Project: Hadoop Common
> Issue Type: Improvement
> Reporter: Jinglun
> Assignee: Jinglun
> Priority: Major
> Attachments: HADOOP-16403-How_MetricLinkedBlockingQueue_Works.pdf,
> HADOOP-16403.001.patch, HADOOP-16403.002.patch,
> MetricLinkedBlockingQueueTest.pdf
>
>
> I have an HA cluster with 2 NameNodes. The NameNode's meta is quite big so
> after the active dead, it takes the standby more than 40s to become active.
> Many requests(tcp connect request and rpc request) from Datanodes, clients
> and zkfc timed out and start retrying. The suddenly request flood lasts for
> the next 2 minutes and finally all requests are either handled or run out of
> retry times.
> Adjusting the rpc related settings might power the NameNode and solve this
> problem and the key point is finding the bottle neck. The rpc server can be
> described as below:
> {noformat}
> Listener -> Readers' queues -> Readers -> callQueue -> Handlers{noformat}
> By sampling some failed clients, I find many of them got
> ConnectTimeoutException. It's caused by a 20s un-responded tcp connect
> request. I think may be the reader queue is full and block the listener from
> handling new connections. Both slow handlers and slow readers can block the
> whole processing progress, and I need to know who it is. I think *a queue
> that computes the qps, write log when the queue is full and could be replaced
> easily* will help.
> I find the nice work HADOOP-10302 implementing a runtime-swapped queue.
> Using it at Reader's queue makes the reader queue runtime-swapped
> automatically. The qps computing job could be done by implementing a subclass
> of LinkedBlockQueue that does the computing job while put/take/... happens.
> The qps data will show on jmx.
>
>
--
This message was sent by Atlassian JIRA
(v7.6.14#76016)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]