[GitHub] [arrow] jorisvandenbossche commented on pull request #13075: ARROW-16467: [Python] Add helper function _exec_plan._filter_table to filter tables based on Expression

GitBox Tue, 10 May 2022 01:01:04 -0700


jorisvandenbossche commented on PR #13075:
URL: https://github.com/apache/arrow/pull/13075#issuecomment-1122058756


   What I observed from trying out this branch is that it does not preserve:
   
   ```
   In [18]: table1 = pa.table({'a': [1, 2, 3, 4], 'b': ['a'] * 4})
   
   In [19]: table2 = pa.table({'a': [1, 2, 3, 4], 'b': ['b'] * 4})
   
   In [20]: table = pa.concat_tables([table1, table2])
   
   In [21]: ep._filter_table(table, pc.field('a') == 1)
   Out[21]: 
   pyarrow.Table
   a: int64
   b: string
   ----
   a: [[1],[1]]
   b: [["b"],["a"]]
   
   In [22]: ep._filter_table(table, pc.field('a') == 1)
   Out[22]: 
   pyarrow.Table
   a: int64
   b: string
   ----
   a: [[1],[1]]
   b: [["a"],["b"]]
   ```
   
   But the current cython wrappers of the ExecPlan are not using a "to_table" 
method, it is using `Table::FromRecordBatchReader`, with a record batch 
generator created using `MakeGeneratorReader`, which explicitly says it does 
not preserve order:
   
   
https://github.com/apache/arrow/blob/5a4b3db396f88898917620de50863b6d3a477a7a/cpp/src/arrow/compute/exec/exec_plan.h#L440-L447
   
   I see that we currently use a "sink" node as final output node, but there is 
also a "table_sink" node. That might simplify the code a bit to get a Table 
(without manually creating a record batch reader and creating a table from 
that), but  don't think that would help with the ordering? (at least in the 
code I also don't see any ordering handling for this sink)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] jorisvandenbossche commented on pull request #13075: ARROW-16467: [Python] Add helper function _exec_plan._filter_table to filter tables based on Expression

Reply via email to