[ 
https://issues.apache.org/jira/browse/HDFS-13616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16490305#comment-16490305
 ] 

Aaron Fabbri commented on HDFS-13616:
-------------------------------------

Cool stuff [~andrew.wang]. I like batching. :)

Didn't have time for full review yet but a couple of quick/stupid questions.
{noformat}
+   * Batched listing API that returns {@link PartialListing}s for the
+   * passed Paths.
+   *
+   * @param paths List of paths to list.
+   * @return RemoteIterator that returns corresponding PartialListings.
+   * @throws IOException
+   */
+  public RemoteIterator<PartialListing<FileStatus>> batchedListStatusIterator(
+      final List<Path> paths) throws IOException {

{noformat}
Are paths listed recursively or not? We might as well specify that here.

Why not just RemoteIterator<FileStatus>?
{noformat}
+ * partial listing, multiple ListingBatches may need to be combined to obtain
{noformat}
ListingBatches? Did you mean PartialListing?

Other thought, test code looks DFS-specific. Do we want to test {{FileSystem}} 
instead and make this a filesystem contract test?

For the benchmarking, what was the comparison code? Recursive 
listLocatedStatus() loop? I'm curious what the delta would be against an 
optimized listFiles(recursive=true) on a parent dir instead. Does that even fit 
the use case? (I'm guessing no, only some of the partition dirs in the parent 
need listing–but we need to justify any new FileSystem surface area).

 

> Batch listing of multiple directories
> -------------------------------------
>
>                 Key: HDFS-13616
>                 URL: https://issues.apache.org/jira/browse/HDFS-13616
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>    Affects Versions: 3.2.0
>            Reporter: Andrew Wang
>            Assignee: Andrew Wang
>            Priority: Major
>         Attachments: HDFS-13616.001.patch, HDFS-13616.002.patch
>
>
> One of the dominant workloads for external metadata services is listing of 
> partition directories. This can end up being bottlenecked on RTT time when 
> partition directories contain a small number of files. This is fairly common, 
> since fine-grained partitioning is used for partition pruning by the query 
> engines.
> A batched listing API that takes multiple paths amortizes the RTT cost. 
> Initial benchmarks show a 10-20x improvement in metadata loading performance.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to