westonpace commented on a change in pull request #11294:
URL: https://github.com/apache/arrow/pull/11294#discussion_r731323572
##########
File path: cpp/src/arrow/dataset/scanner.cc
##########
@@ -1175,8 +1179,15 @@ Result<compute::ExecNode*>
MakeScanNode(compute::ExecPlan* plan,
ARROW_ASSIGN_OR_RAISE(auto batch_gen_gen,
FragmentsToBatches(std::move(fragment_gen),
scan_options));
- auto merged_batch_gen =
- MakeMergedGenerator(std::move(batch_gen_gen),
scan_options->fragment_readahead);
+ AsyncGenerator<EnumeratedRecordBatch> merged_batch_gen;
+ if (require_sequenced_output) {
+ ARROW_ASSIGN_OR_RAISE(merged_batch_gen,
+
MakeSequencedMergedGenerator(std::move(batch_gen_gen),
+
scan_options->fragment_readahead));
Review comment:
No, it is not redundant. We still need to resequence at the end
(currently MakeSequencingGenerator is used after the plan) because there is no
guarantee operations like filter or project will maintain the ordering.
Although, we could introduce a sequenced sink which would allow us to get rid
of MakeSequencingGenerator. We can tackle that when we tackle general
sequencing.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]