[ 
https://issues.apache.org/jira/browse/HDFS-13616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16490068#comment-16490068
 ] 

Andrew Wang commented on HDFS-13616:
------------------------------------

Latest patch addresses some precommit issues. As stated earlier, non-HDFS 
filesystems are going to throw UnsupportedOperationException. One correction to 
my earlier comment too, the default listing limit is 1000, not 100. 100 is the 
current default limit on the number of paths that can be listed per batched 
listing call.

Hi Nicholas, thanks for taking a look. Currently we don't see a need for API 
support beyond listing. The workload we're looking at is metadata loading for 
applications like Hive and Impala.

Regarding an async API, Todd's benchmarking shows that the batched API is more 
CPU efficient than processing individual listing calls. It beats the 5-thread 
case for sparse directories in CPU time and wall time. My benchmarking 
additionally shows that the batched API generates significantly less garbage.

This batched listing API could also be combined with an async API (or a thread 
pool), so it's not an "either or" situation.

> Batch listing of multiple directories
> -------------------------------------
>
>                 Key: HDFS-13616
>                 URL: https://issues.apache.org/jira/browse/HDFS-13616
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>    Affects Versions: 3.2.0
>            Reporter: Andrew Wang
>            Assignee: Andrew Wang
>            Priority: Major
>         Attachments: HDFS-13616.001.patch, HDFS-13616.002.patch
>
>
> One of the dominant workloads for external metadata services is listing of 
> partition directories. This can end up being bottlenecked on RTT time when 
> partition directories contain a small number of files. This is fairly common, 
> since fine-grained partitioning is used for partition pruning by the query 
> engines.
> A batched listing API that takes multiple paths amortizes the RTT cost. 
> Initial benchmarks show a 10-20x improvement in metadata loading performance.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to