[GitHub] [arrow] westonpace commented on a change in pull request #9589: ARROW-11797: [C++][Dataset] Provide batch stream Scanner methods

GitBox Wed, 07 Apr 2021 15:57:50 -0700


westonpace commented on a change in pull request #9589:
URL: https://github.com/apache/arrow/pull/9589#discussion_r609122759




##########
File path: cpp/src/arrow/dataset/scanner.h
##########
@@ -165,6 +166,9 @@ class ARROW_DS_EXPORT Scanner {
   /// Scan result in memory before creating the Table.
   Result<std::shared_ptr<Table>> ToTable();
 
+  /// \brief ToBatches returns an iterator over all Batches yielded by this 
scan.
+  Result<RecordBatchIterator> ToBatches();

Review comment:
       Hmm, the issue I ran into was that `ScanBatches` was used by 
`FileSystemDataset::Write` and it needed the fragment info in order to have 
access to the fragment's partition expression.  So at a minimum I needed to 
return "record batch & partition it came from".
   
   I think there was some discussion (either on the ML or some JIRA/PR) about 
the benefit of keeping the fragment available as the user might want to know 
where the batch came from.
   
   Can you modify the `ScanBatches` here to return a RecordBatch/Fragment pair? 
 I could align my `ScanBatches` with that (`PositionedRecordBatch` is overkill).




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] westonpace commented on a change in pull request #9589: ARROW-11797: [C++][Dataset] Provide batch stream Scanner methods

Reply via email to