[ 
https://issues.apache.org/jira/browse/HDFS-10475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15491251#comment-15491251
 ] 

Erik Krogen commented on HDFS-10475:
------------------------------------

Thanks for the pointer to RpcDetailedActivity, [~andrew.wang]. Definitely very 
helpful! 

I also looked at HdrHistogram. The information that it could provide seems very 
informative but I wonder how we would publish the histograms? Such information 
does not seem to fit into the existing metrics publication framework. If you 
had ideas about this let me know.  

The reason we are interested in pursuing this at the lock level rather than the 
RPC level is that RPC time includes e.g. the time that an operation spent 
waiting in the lock queue, so if an operation has a long RPC time it is not 
clear whether that is due to getting blocked behind other long operations or if 
it is due to slowness within the operation itself. It would be useful to be 
able to drill down to find specific culprit operations that are spending a lock 
of time holding the lock. 

[~kihwal], for your first two examples, even if the frequency was low this 
would still show as a spike in the metrics, right? Combined with the long-held 
lock logging from HDFS-10817 and HDFS-9145 it seems these cases should be 
pretty well covered. Doing lock-level metrics would enable us to capture the 
last examples you discussed which cannot be captured by the current RPC-level 
metrics. 

The question about {{getContentSummary}} is interesting, but this is a special 
case, right? It would be prudent to keep in mind when looking at the metrics 
for {{getContentSummary}} that the "number of ops" may be an overestimate since 
each lock period would be counted as an op, but the overall time spent locking 
for {{getContentSummary}} would still be accurately logged, which would still 
help to provide an idea of which operations are expensive in terms of locking. 

> Adding metrics for long FSNamesystem read and write locks
> ---------------------------------------------------------
>
>                 Key: HDFS-10475
>                 URL: https://issues.apache.org/jira/browse/HDFS-10475
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: namenode
>            Reporter: Xiaoyu Yao
>            Assignee: Erik Krogen
>
> This is a follow up of the comment on HADOOP-12916 and 
> [here|https://issues.apache.org/jira/browse/HDFS-9924?focusedCommentId=15310837&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15310837]
>  add more metrics and WARN/DEBUG logs for long FSD/FSN locking operations on 
> namenode similar to what we have for slow write/network WARN/metrics on 
> datanode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to