jorisvandenbossche commented on PR #13075:
URL: https://github.com/apache/arrow/pull/13075#issuecomment-1122058756
What I observed from trying out this branch is that it does not preserve:
```
In [18]: table1 = pa.table({'a': [1, 2, 3, 4], 'b': ['a'] * 4})
In [19]: table2 = pa.table({'a': [1, 2, 3, 4], 'b': ['b'] * 4})
In [20]: table = pa.concat_tables([table1, table2])
In [21]: ep._filter_table(table, pc.field('a') == 1)
Out[21]:
pyarrow.Table
a: int64
b: string
----
a: [[1],[1]]
b: [["b"],["a"]]
In [22]: ep._filter_table(table, pc.field('a') == 1)
Out[22]:
pyarrow.Table
a: int64
b: string
----
a: [[1],[1]]
b: [["a"],["b"]]
```
But the current cython wrappers of the ExecPlan are not using a "to_table"
method, it is using `Table::FromRecordBatchReader`, with a record batch
generator created using `MakeGeneratorReader`, which explicitly says it does
not preserve order:
https://github.com/apache/arrow/blob/5a4b3db396f88898917620de50863b6d3a477a7a/cpp/src/arrow/compute/exec/exec_plan.h#L440-L447
I see that we currently use a "sink" node as final output node, but there is
also a "table_sink" node. That might simplify the code a bit to get a Table
(without manually creating a record batch reader and creating a table from
that), but don't think that would help with the ordering? (at least in the
code I also don't see any ordering handling for this sink)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]