pitrou commented on code in PR #13796:
URL: https://github.com/apache/arrow/pull/13796#discussion_r939502452


##########
cpp/src/arrow/filesystem/localfs.cc:
##########
@@ -309,6 +313,202 @@ Result<std::vector<FileInfo>> 
LocalFileSystem::GetFileInfo(const FileSelector& s
   return results;
 }
 
+namespace {
+
+/// Workhorse for streaming async implementation of `GetFileInfo`
+/// (`GetFileInfoGenerator`).
+///
+/// There are two variants of async discovery functions suported:
+/// 1. `DiscoverPartitionFiles`, which parallelizes traversal of individual 
directories
+///    so that each directory results are yielded as a separate 
`FileInfoGenerator` via
+///    an underlying `DiscoveryImplIterator`, which delivers items in chunks 
(default size
+///    is `kBatchSize == 1K` items).

Review Comment:
   I don't think delivering items in fixed-size chunks is a worthwhile goal, on 
the contrary.
   The purpose of an async API such as `GetFileInfoGenerator` is to deliver 
results to the consumer as soon as they arrive. So when you get the contents of 
a 5-file directory, emit them ASAP as a single chunk. When you get the contents 
of a 999999-file directory, emit them ASAP as a single chunk.
   
   The consumer may want to normalize batch size if that makes sense for them, 
but it might not (for example if you simply print files on stdout, why would 
you enforce a batch size?).
   
   This will also simplify the producer implementation :-)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to