lidavidm commented on code in PR #14663:
URL: https://github.com/apache/arrow/pull/14663#discussion_r1035929873


##########
cpp/src/arrow/dataset/file_csv.cc:
##########
@@ -52,9 +53,99 @@ using internal::SerialExecutor;
 
 namespace dataset {
 
+struct CsvInspectedFragment : public InspectedFragment {
+  CsvInspectedFragment(std::vector<std::string> column_names,
+                       std::shared_ptr<io::InputStream> input_stream, int64_t 
num_bytes)
+      : InspectedFragment(std::move(column_names)),
+        input_stream(std::move(input_stream)),
+        num_bytes(num_bytes) {}
+  // We need to start reading the file in order to figure out the column names 
and
+  // so we save off the input stream
+  std::shared_ptr<io::InputStream> input_stream;
+  int64_t num_bytes;
+};
+
+class CsvFileScanner : public FragmentScanner {
+ public:
+  CsvFileScanner(std::shared_ptr<csv::StreamingReader> reader, int num_batches,
+                 int64_t best_guess_bytes_per_batch)
+      : reader_(std::move(reader)),
+        num_batches_(num_batches),
+        best_guess_bytes_per_batch_(best_guess_bytes_per_batch) {}
+
+  Future<std::shared_ptr<RecordBatch>> ScanBatch(int batch_number) override {
+    // This should be called in increasing order but let's verify that in case 
it changes.
+    // It would be easy enough to handle out of order but no need for that 
complexity at
+    // the moment.
+    DCHECK_EQ(scanned_so_far_++, batch_number);
+    return reader_->ReadNextAsync().Then(
+        [](const std::shared_ptr<RecordBatch>& batch) { return batch; });

Review Comment:
   Hmm, isn't the `Then` here redundant?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to