ManManson commented on code in PR #13796:
URL: https://github.com/apache/arrow/pull/13796#discussion_r939505349


##########
cpp/src/arrow/filesystem/localfs.cc:
##########
@@ -309,6 +313,202 @@ Result<std::vector<FileInfo>> 
LocalFileSystem::GetFileInfo(const FileSelector& s
   return results;
 }
 
+namespace {
+
+/// Workhorse for streaming async implementation of `GetFileInfo`
+/// (`GetFileInfoGenerator`).
+///
+/// There are two variants of async discovery functions suported:
+/// 1. `DiscoverPartitionFiles`, which parallelizes traversal of individual 
directories
+///    so that each directory results are yielded as a separate 
`FileInfoGenerator` via
+///    an underlying `DiscoveryImplIterator`, which delivers items in chunks 
(default size
+///    is `kBatchSize == 1K` items).

Review Comment:
   If we yield on every new item in the queue, more work will be just wasted on 
moving back and forth between IO and CPU threads, AFAIU, so internal chunking 
makes perfect sense. Moreover, it's naturally supported by the signature of 
`GetFileInfoGenerator`, which is supposed to deliver items in some chunks 
(though, without indication of batch sizes).
   
   If consumer wants a flattened view (e.g. `AsyncGenerator<FileInfo>`), it 
would want use more primitives from `async_generator.h` anyway.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to