[ 
https://issues.apache.org/jira/browse/HDFS-14297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16772449#comment-16772449
 ] 

Erik Krogen commented on HDFS-14297:
------------------------------------

Hi [~Tao Jie], do the directories you are monitoring have quotas set on them? 
If so, you can use the {{getQuotaUsage()}} API, which is a constant-time 
operation.

> Add cache for getContentSummary() result
> ----------------------------------------
>
>                 Key: HDFS-14297
>                 URL: https://issues.apache.org/jira/browse/HDFS-14297
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>            Reporter: Tao Jie
>            Priority: Major
>
> In a large HDFS cluster, calling {{getContentSummary}} for a directory with 
> large amount of files is very expensive. In a certain cluster with more than 
> 100 million files, calling {{getContentSummary}} may take more than 10s and 
> it will hold fsnamesystem lock for such a long time.
> In our cluster, there are several peripheral systems calling 
> {{getContentSummary}} periodically to monitor the status of dirs. Actually we 
> don't need the very accurate result in most cases. We could keep a cache for 
> those contentSummary result in namenode, with which we could avoid repeated 
> heavy request in a span. Also we should add more restrictions to  this cache: 
> 1,its size should be limited and it should be LRU, 2, only result of heavy 
> request would be  added to this cache, eg, rpctime over 1000ms.
> We may create a new RPC method or add a flag to the current method so that we 
> will not modify the current behavior and we can have a choose of a accurate 
> but expensive method or a fast but inaccurate method. 
> Any thought?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to