lidavidm commented on a change in pull request #10008: URL: https://github.com/apache/arrow/pull/10008#discussion_r612469861
########## File path: cpp/src/arrow/dataset/dataset.cc ########## @@ -95,6 +95,33 @@ Result<ScanTaskIterator> InMemoryFragment::Scan(std::shared_ptr<ScanOptions> opt return MakeMapIterator(fn, std::move(batches_it)); } +Result<RecordBatchGenerator> InMemoryFragment::ScanBatchesAsync( + const ScanOptions& options) { + struct Generator { + Future<std::shared_ptr<RecordBatch>> operator()() { + if (batch_index >= self->record_batches_.size()) { + return AsyncGeneratorEnd<std::shared_ptr<RecordBatch>>(); + } + const auto& next_parent = self->record_batches_[batch_index]; + if (offset + batch_size < next_parent->num_rows()) { + offset += batch_size; + auto next = next_parent->Slice(offset, batch_size); + return Future<std::shared_ptr<RecordBatch>>::MakeFinished(std::move(next)); + } + batch_index++; + auto next = next_parent->Slice(offset, batch_size); + return Future<std::shared_ptr<RecordBatch>>::MakeFinished(std::move(next)); Review comment: A few things here: - Shouldn't `offset` be reset when we advance to the next batch? - The check for whether we've consumed the current batch should just be `offset < num_rows()` I think. - `next_parent->Slice` should come before we update the offset. - It might be easier to just recurse after advancing to the next batch, if we care about avoiding empty batches. Else, we should update `offset` after the second `Slice` call too. ########## File path: cpp/src/arrow/dataset/scanner_test.cc ########## @@ -36,8 +36,20 @@ constexpr int64_t kNumberChildDatasets = 2; constexpr int64_t kNumberBatches = 16; constexpr int64_t kBatchSize = 1024; -class TestScanner : public DatasetFixtureMixin { +struct PrintIsAsyncParam { + std::string operator()(::testing::TestParamInfo<bool> info) { + if (info.param) { + return "async"; + } else { + return "sync"; + } + } +}; + +class TestScanner : public DatasetFixtureMixinWithParam<bool> { Review comment: ARROW-11797 uses the param to toggle UseThreads, so this will have to become a `std::pair<bool, bool>` (or really, just a custom struct) in the end. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org