[ 
https://issues.apache.org/jira/browse/HDFS-13616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16490032#comment-16490032
 ] 

Andrew Wang commented on HDFS-13616:
------------------------------------

Hi Zhe, thanks for taking a look! This API respects the existing lsLimit 
setting of 100, and also limits the number of paths that can be listed in a 
single batch call. This means that the per-call overhead is very similar to the 
existing RemoteIterator<FileStatus> calls when returning 100-item partial 
listings. Todd saw ~7ms RPC handling times for 100-item batches on a cluster, 
which feels like the right granularity for holding a read lock.

To answer Todd's question about benchmarking, I wrote a little unit test that 
invokes NameNodeRpcServer directly and times with System.nanotime(). I made a 
synthetic directory structure with 30,000 directories, each with one file. This 
makes it a best case scenario for the batched listing API. Precautions were 
taken to allow JVM warmup, I let the benchmarks run for about 30s before 
recording with JFR/JMC.

I was able to list 8.4x more LocatedFileStatuses/second with the batched 
listing. JMC showed a TLAB allocation rate of 5x. Non-TLAB allocation was 
trivial. This means we're much more CPU efficient per-FileStatus, and also 
doing less allocation.

Since this did not include RTT time or lock contention from concurrent threads, 
a more realistic benchmark might do even better. I think this explains the 
10-20x that Todd saw when benchmarking on a real cluster.

> Batch listing of multiple directories
> -------------------------------------
>
>                 Key: HDFS-13616
>                 URL: https://issues.apache.org/jira/browse/HDFS-13616
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>    Affects Versions: 3.2.0
>            Reporter: Andrew Wang
>            Assignee: Andrew Wang
>            Priority: Major
>         Attachments: HDFS-13616.001.patch
>
>
> One of the dominant workloads for external metadata services is listing of 
> partition directories. This can end up being bottlenecked on RTT time when 
> partition directories contain a small number of files. This is fairly common, 
> since fine-grained partitioning is used for partition pruning by the query 
> engines.
> A batched listing API that takes multiple paths amortizes the RTT cost. 
> Initial benchmarks show a 10-20x improvement in metadata loading performance.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to