westonpace commented on a change in pull request #10103:
URL: https://github.com/apache/arrow/pull/10103#discussion_r619488088
##########
File path: cpp/src/arrow/dataset/file_csv.cc
##########
@@ -242,5 +240,15 @@ Result<ScanTaskIterator> CsvFileFormat::ScanFile(
return MakeVectorIterator<std::shared_ptr<ScanTask>>({std::move(task)});
}
+Result<RecordBatchGenerator> CsvFileFormat::ScanBatchesAsync(
+ const std::shared_ptr<ScanOptions>& scan_options,
+ const std::shared_ptr<FileFragment>& file) const {
+ auto this_ = checked_pointer_cast<const CsvFileFormat>(shared_from_this());
+ auto source = file->source();
+ auto reader_fut =
+ OpenReaderAsync(source, *this, scan_options,
internal::GetCpuThreadPool());
+ return GeneratorFromReader(std::move(reader_fut));
Review comment:
The CSV reader does not support parallel readahead yet. Serial
readahead might be redundant since the background generator is pretty much
already doing that. I suppose it would give us a bit of pipeline parallelism
(allowing us to filter & project while we parse & decode) but at the cost
breaking up some of the processing locality (e.g. right now we always parse X,
decode X, filter X, project X). I'll run some experiments and see if it has
any noticeable effect in either direction.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]