[ https://issues.apache.org/jira/browse/ARROW-8447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ben Kietzman resolved ARROW-8447. --------------------------------- Resolution: Fixed Issue resolved by pull request 7075 [https://github.com/apache/arrow/pull/7075] > [C++][Dataset] Ensure Scanner::ToTable preserve ordering of ScanTasks > --------------------------------------------------------------------- > > Key: ARROW-8447 > URL: https://issues.apache.org/jira/browse/ARROW-8447 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ > Reporter: Francois Saint-Jacques > Assignee: Francois Saint-Jacques > Priority: Major > Labels: dataset, pull-request-available > Fix For: 1.0.0 > > Time Spent: 0.5h > Remaining Estimate: 0h > > This can be refactored with a little effort in Scanner::ToTable: > # Change `batches` to `std::vector<RecordBatchVector>` > # When pushing the closure to the TaskGroup, also track an incrementing > integer, e.g. scan_task_id > # In the closure, store the RecordBatches for this ScanTask in a local > vector, when all batches are consumed, move the local vector in the `batches` > at the right index, resizing and emplacing with mutex > # After waiting for the task group completion either > * Flatten into a single vector and call `Table::FromRecordBatch` or > * Write a RecordBatchReader that supports vector<vector<RecordBatch> and add > method `Table::FromRecordBatchReader` > The later involves more work but is the clean way, the other FromRecordBatch > method can be implemented from it and support "streaming". -- This message was sent by Atlassian Jira (v8.3.4#803005)