[jira] [Commented] (ARROW-10027) [Python] Incorrect null column returned when using a dataset filter expression.

Troy Zimmerman (Jira) Thu, 17 Sep 2020 06:42:42 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-10027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17197685#comment-17197685
 ]


Troy Zimmerman commented on ARROW-10027:
----------------------------------------

[~jorisvandenbossche] Thank you for the quick & detailed response.

I'll take a closer look at the core that is dumped to see if I can narrow down 
what's causing the crash since it just seems to be on my end.

> [Python] Incorrect null column returned when using a dataset filter 
> expression.
> -------------------------------------------------------------------------------
>
>                 Key: ARROW-10027
>                 URL: https://issues.apache.org/jira/browse/ARROW-10027
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 1.0.1
>            Reporter: Troy Zimmerman
>            Assignee: Joris Van den Bossche
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> When using dataset filter expressions (which I <3) with Parquet files, entire 
> {{null}} columns are returned, rather than rows that matched other columns in 
> the filter. 
> Here's an example.
> {code:python}
> In [7]: import pyarrow as pa
> In [8]: import pyarrow.dataset as ds
> In [9]: import pyarrow.parquet as pq
> In [10]: table = pa.Table.from_arrays(
>  ...:     arrays=[
>  ...:         pa.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]),
>  ...:         pa.array(["zero", "one", "two", "three", "four", "five", "six", 
> "seven", "eight", "nine"]),
>  ...:         pa.array([None, None, None, None, None, None, None, None, None, 
> None]),
>  ...:     ],
>  ...:     names=["id", "name", "other"],
>  ...: )
> In [11]: table
> Out[11]:
> pyarrow.Table
> id: int64
> name: string
> other: null
> In [12]: table.to_pandas()
> Out[12]:
>    id   name other
> 0   0   zero  None
> 1   1    one  None
> 2   2    two  None
> 3   3  three  None
> 4   4   four  None
> 5   5   five  None
> 6   6    six  None
> 7   7  seven  None
> 8   8  eight  None
> 9   9   nine  None
> In [13]: pq.write_table(table, "/tmp/test.parquet", data_page_version="2.0")
> In [14]: data = ds.dataset("/tmp/test.parquet")
> In [15]: table = data.to_table(filter=ds.field("id").isin([1, 4, 7]))
> In [16]: table
> Out[16]:
> pyarrow.Table
> id: int64
> name: string
> other: null
> In [17]: table.to_pydict()
> Out[17]:
> {'id': [1, 4, 7],
>  'name': ['one', 'four', 'seven'],
>  'other': [None, None, None, None, None, None, None, None, None, None]}
> {code}
> The {{to_pydict}} method highlights the strange behavior: the {{id}} and 
> {{name}} columns have 3 elements, but the {{other}} column has all 10. When I 
> call {{to_pandas}} on the filtered table, the program crashes.
> This could be a C++ issue, but, since my examples are in Python, I 
> categorized it as a Python issue. Let me know if that's wrong and I'll note 
> that for the future.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-10027) [Python] Incorrect null column returned when using a dataset filter expression.

Reply via email to