westonpace commented on a change in pull request #9589:
URL: https://github.com/apache/arrow/pull/9589#discussion_r609122759
##########
File path: cpp/src/arrow/dataset/scanner.h
##########
@@ -165,6 +166,9 @@ class ARROW_DS_EXPORT Scanner {
/// Scan result in memory before creating the Table.
Result<std::shared_ptr<Table>> ToTable();
+ /// \brief ToBatches returns an iterator over all Batches yielded by this
scan.
+ Result<RecordBatchIterator> ToBatches();
Review comment:
Hmm, the issue I ran into was that `ScanBatches` was used by
`FileSystemDataset::Write` and it needed the fragment info in order to have
access to the fragment's partition expression. So at a minimum I needed to
return "record batch & partition it came from".
I think there was some discussion (either on the ML or some JIRA/PR) about
the benefit of keeping the fragment available as the user might want to know
where the batch came from.
Can you modify the `ScanBatches` here to return a RecordBatch/Fragment pair?
I could align my `ScanBatches` with that (`PositionedRecordBatch` is overkill).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]