ManManson commented on code in PR #13796:
URL: https://github.com/apache/arrow/pull/13796#discussion_r939504882
##########
cpp/src/arrow/filesystem/localfs.cc:
##########
@@ -309,6 +313,202 @@ Result<std::vector<FileInfo>>
LocalFileSystem::GetFileInfo(const FileSelector& s
return results;
}
+namespace {
+
+/// Workhorse for streaming async implementation of `GetFileInfo`
+/// (`GetFileInfoGenerator`).
+///
+/// There are two variants of async discovery functions suported:
+/// 1. `DiscoverPartitionFiles`, which parallelizes traversal of individual
directories
+/// so that each directory results are yielded as a separate
`FileInfoGenerator` via
+/// an underlying `DiscoveryImplIterator`, which delivers items in chunks
(default size
+/// is `kBatchSize == 1K` items).
Review Comment:
> I don't think delivering items in fixed-size chunks is a worthwhile goal,
on the contrary.
The purpose of an async API such as GetFileInfoGenerator is to deliver
results to the consumer as soon as they arrive. So when you get the contents of
a 5-file directory, emit them ASAP as a single chunk. When you get the contents
of a 999999-file directory, emit them ASAP as a single chunk.
This is exactly what this impl does. Each directory iterator has its own
chunk tracking and flushes the chunks either as they are filled up (to
`kBatchSize` entries) or the iterator is exhausted (in this case
`current_chunk_.size()` would be less than `kBatchSize`). So I see there no
problem at all, although the chunk size should be a configurable option, too.
Also, I just noticed that I always reserve `kBatchSize` elements for a
chunk, regardless of how much elements are left unprocessed. In such case the
allocation size can optimized by checking against `child_fns_.size() - idx_`.
I'll fix that one.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]