[
https://issues.apache.org/jira/browse/HDFS-13616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16791908#comment-16791908
]
Todd Lipcon commented on HDFS-13616:
------------------------------------
Another data point here: this would be very useful for the Hive ACID table
layout as well. For example, currently, querying a Hive ACID table requires
Hive to do one listStatus for each partition, and then within the partitions,
one listStatus per uncompacted transaction range (minimum 1 for a fully compact
table). Again for a fully compacted table with relatively fine grained
partitions, the ratio of returned files to listStatus calls can be quite small.
If we assume that a large portion of the load on a NN might be coming from a
Hive workload, implementing RPC batching could reduce RPC rate by an order of
magnitude or more.
> Batch listing of multiple directories
> -------------------------------------
>
> Key: HDFS-13616
> URL: https://issues.apache.org/jira/browse/HDFS-13616
> Project: Hadoop HDFS
> Issue Type: New Feature
> Affects Versions: 3.2.0
> Reporter: Andrew Wang
> Assignee: Andrew Wang
> Priority: Major
> Attachments: BenchmarkListFiles.java, HDFS-13616.001.patch,
> HDFS-13616.002.patch
>
>
> One of the dominant workloads for external metadata services is listing of
> partition directories. This canĀ end up being bottlenecked on RTT time when
> partition directories contain a small number of files. This is fairly common,
> since fine-grained partitioning is used for partition pruning by the query
> engines.
> A batched listing API that takes multiple paths amortizes the RTT cost.
> Initial benchmarks show a 10-20x improvement in metadata loading performance.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]