lidavidm commented on a change in pull request #10008:
URL: https://github.com/apache/arrow/pull/10008#discussion_r612469861



##########
File path: cpp/src/arrow/dataset/dataset.cc
##########
@@ -95,6 +95,33 @@ Result<ScanTaskIterator> 
InMemoryFragment::Scan(std::shared_ptr<ScanOptions> opt
   return MakeMapIterator(fn, std::move(batches_it));
 }
 
+Result<RecordBatchGenerator> InMemoryFragment::ScanBatchesAsync(
+    const ScanOptions& options) {
+  struct Generator {
+    Future<std::shared_ptr<RecordBatch>> operator()() {
+      if (batch_index >= self->record_batches_.size()) {
+        return AsyncGeneratorEnd<std::shared_ptr<RecordBatch>>();
+      }
+      const auto& next_parent = self->record_batches_[batch_index];
+      if (offset + batch_size < next_parent->num_rows()) {
+        offset += batch_size;
+        auto next = next_parent->Slice(offset, batch_size);
+        return 
Future<std::shared_ptr<RecordBatch>>::MakeFinished(std::move(next));
+      }
+      batch_index++;
+      auto next = next_parent->Slice(offset, batch_size);
+      return 
Future<std::shared_ptr<RecordBatch>>::MakeFinished(std::move(next));

Review comment:
       A few things here:
   - Shouldn't `offset` be reset when we advance to the next batch?
   - The check for whether we've consumed the current batch should just be 
`offset < num_rows()` I think.
   - `next_parent->Slice` should come before we update the offset.
   - It might be easier to just recurse after advancing to the next batch, if 
we care about avoiding empty batches. Else, we should update `offset` after the 
second `Slice` call too.

##########
File path: cpp/src/arrow/dataset/scanner_test.cc
##########
@@ -36,8 +36,20 @@ constexpr int64_t kNumberChildDatasets = 2;
 constexpr int64_t kNumberBatches = 16;
 constexpr int64_t kBatchSize = 1024;
 
-class TestScanner : public DatasetFixtureMixin {
+struct PrintIsAsyncParam {
+  std::string operator()(::testing::TestParamInfo<bool> info) {
+    if (info.param) {
+      return "async";
+    } else {
+      return "sync";
+    }
+  }
+};
+
+class TestScanner : public DatasetFixtureMixinWithParam<bool> {

Review comment:
       ARROW-11797 uses the param to toggle UseThreads, so this will have to 
become a `std::pair<bool, bool>` (or really, just a custom struct) in the end.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to