[ https://issues.apache.org/jira/browse/ARROW-10027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17197685#comment-17197685 ]
Troy Zimmerman commented on ARROW-10027: ---------------------------------------- [~jorisvandenbossche] Thank you for the quick & detailed response. I'll take a closer look at the core that is dumped to see if I can narrow down what's causing the crash since it just seems to be on my end. > [Python] Incorrect null column returned when using a dataset filter > expression. > ------------------------------------------------------------------------------- > > Key: ARROW-10027 > URL: https://issues.apache.org/jira/browse/ARROW-10027 > Project: Apache Arrow > Issue Type: Bug > Components: Python > Affects Versions: 1.0.1 > Reporter: Troy Zimmerman > Assignee: Joris Van den Bossche > Priority: Major > Labels: pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > > When using dataset filter expressions (which I <3) with Parquet files, entire > {{null}} columns are returned, rather than rows that matched other columns in > the filter. > Here's an example. > {code:python} > In [7]: import pyarrow as pa > In [8]: import pyarrow.dataset as ds > In [9]: import pyarrow.parquet as pq > In [10]: table = pa.Table.from_arrays( > ...: arrays=[ > ...: pa.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]), > ...: pa.array(["zero", "one", "two", "three", "four", "five", "six", > "seven", "eight", "nine"]), > ...: pa.array([None, None, None, None, None, None, None, None, None, > None]), > ...: ], > ...: names=["id", "name", "other"], > ...: ) > In [11]: table > Out[11]: > pyarrow.Table > id: int64 > name: string > other: null > In [12]: table.to_pandas() > Out[12]: > id name other > 0 0 zero None > 1 1 one None > 2 2 two None > 3 3 three None > 4 4 four None > 5 5 five None > 6 6 six None > 7 7 seven None > 8 8 eight None > 9 9 nine None > In [13]: pq.write_table(table, "/tmp/test.parquet", data_page_version="2.0") > In [14]: data = ds.dataset("/tmp/test.parquet") > In [15]: table = data.to_table(filter=ds.field("id").isin([1, 4, 7])) > In [16]: table > Out[16]: > pyarrow.Table > id: int64 > name: string > other: null > In [17]: table.to_pydict() > Out[17]: > {'id': [1, 4, 7], > 'name': ['one', 'four', 'seven'], > 'other': [None, None, None, None, None, None, None, None, None, None]} > {code} > The {{to_pydict}} method highlights the strange behavior: the {{id}} and > {{name}} columns have 3 elements, but the {{other}} column has all 10. When I > call {{to_pandas}} on the filtered table, the program crashes. > This could be a C++ issue, but, since my examples are in Python, I > categorized it as a Python issue. Let me know if that's wrong and I'll note > that for the future. -- This message was sent by Atlassian Jira (v8.3.4#803005)