lidavidm commented on a change in pull request #10008: URL: https://github.com/apache/arrow/pull/10008#discussion_r614827209
########## File path: cpp/src/arrow/dataset/scanner.h ########## @@ -318,6 +337,28 @@ class ARROW_DS_EXPORT SyncScanner : public Scanner { std::shared_ptr<Fragment> fragment_; }; +class ARROW_DS_EXPORT AsyncScanner : public Scanner, Review comment: Is this worth declaring in the header? We could keep it in the .cc file so that people don't instantiate it directly. ########## File path: cpp/src/arrow/dataset/scanner.cc ########## @@ -244,6 +244,8 @@ Status ScannerBuilder::UseThreads(bool use_threads) { return Status::OK(); } +void ScannerBuilder::UseAsync(bool use_async) { scan_options_->use_async = use_async; } Review comment: nit: all the other methods return Status (even if it's a bit pointless) ########## File path: cpp/src/arrow/dataset/scanner.h ########## @@ -166,13 +169,29 @@ class ARROW_DS_EXPORT ScanTask { std::shared_ptr<Fragment> fragment_; }; -template <typename T> -struct Enumerated { - T value; - int index; - bool last; +/// \brief A trivial ScanTask that yields the RecordBatch of an array. Review comment: FWIW I had moved these helpers to the bottom of the file (outside the `@}`) so that they don't get picked up in the API docs as I'd basically consider them implementation details. Though at that point we may actually want them to be in one of the `_internal` headers. ########## File path: cpp/src/arrow/dataset/file_base.cc ########## @@ -102,6 +102,58 @@ Result<std::shared_ptr<FileFragment>> FileFormat::MakeFragment( std::move(partition_expression), std::move(physical_schema))); } +// TODO(ARROW-12355[CSV], ARROW-11772[IPC], ARROW-11843[Parquet]) The following +// implementation of ScanBatchesAsync is both ugly and terribly ineffecient. Each of the +// formats should provide their own efficient implementation. +Result<RecordBatchGenerator> FileFormat::ScanBatchesAsync( + const ScanOptions& options, const std::shared_ptr<FileFragment>& file) { + std::shared_ptr<ScanOptions> scan_options = std::make_shared<ScanOptions>(options); Review comment: nit: if we're going to heap-allocate them anyways, and the sync version takes a shared_ptr, maybe we can just take a shared_ptr here in the first place -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org