[ 
https://issues.apache.org/jira/browse/ARROW-10027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17197535#comment-17197535
 ] 

Joris Van den Bossche commented on ARROW-10027:
-----------------------------------------------

So it seems this is a bug not directly in the Dataset code, but in the filter 
operation. Also when manually filtering a RecordBatch, it incorrectly returns a 
batch with the null column not being filtered:

{code}
table = pa.Table.from_arrays(
    arrays=[
        pa.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]),
        pa.array(["zero", "one", "two", "three", "four", "five", "six", 
"seven", "eight", "nine"]),
        pa.array([None, None, None, None, None, None, None, None, None, None]),
    ],
    names=["id", "name", "other"],
)

batch = table.to_batches()[0]
{code}

{code}
In [32]: batch
Out[32]: 
pyarrow.RecordBatch
id: int64
name: string
other: null

In [33]: batch.num_rows
Out[33]: 10

In [34]: filtered_batch = batch.filter(pa.array([True, False]*5))

In [35]: filtered_table.num_rows
Out[35]: 5

In [36]: filtered_batch.column(2)
Out[36]: 
<pyarrow.lib.NullArray object at 0x7fdf9c4002e8>
10 nulls

In [37]: len(filtered_batch.column(2))
Out[37]: 10
{code}


Directly filtering on the array or chunked array or on a Table seems to work, 
though:

{code}
In [38]: filtered_table = table.filter(pa.array([True, False]*5))

In [39]: filtered_table.num_rows
Out[39]: 5

In [40]: filtered_table['other']
Out[40]: 
<pyarrow.lib.ChunkedArray object at 0x7fdf9c3b9938>
[
5 nulls
]

In [41]: chunked_array = table['other']

In [42]: chunked_array
Out[42]: 
<pyarrow.lib.ChunkedArray object at 0x7fdf9c391410>
[
10 nulls
]

In [43]: chunked_array.filter(pa.array([True, False]*5))
Out[43]: 
<pyarrow.lib.ChunkedArray object at 0x7fdf9c352c50>
[
5 nulls
]

In [44]: chunked_array.chunks[0].filter(pa.array([True, False]*5))
Out[44]: 
<pyarrow.lib.NullArray object at 0x7fdf9c362e88>
5 nulls

{code}

> [Python] Incorrect null column returned when using a dataset filter 
> expression.
> -------------------------------------------------------------------------------
>
>                 Key: ARROW-10027
>                 URL: https://issues.apache.org/jira/browse/ARROW-10027
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 1.0.1
>            Reporter: Troy Zimmerman
>            Priority: Major
>
> When using dataset filter expressions (which I <3) with Parquet files, entire 
> {{null}} columns are returned, rather than rows that matched other columns in 
> the filter. 
> Here's an example.
> {code:python}
> In [7]: import pyarrow as pa
> In [8]: import pyarrow.dataset as ds
> In [9]: import pyarrow.parquet as pq
> In [10]: table = pa.Table.from_arrays(
>  ...:     arrays=[
>  ...:         pa.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]),
>  ...:         pa.array(["zero", "one", "two", "three", "four", "five", "six", 
> "seven", "eight", "nine"]),
>  ...:         pa.array([None, None, None, None, None, None, None, None, None, 
> None]),
>  ...:     ],
>  ...:     names=["id", "name", "other"],
>  ...: )
> In [11]: table
> Out[11]:
> pyarrow.Table
> id: int64
> name: string
> other: null
> In [12]: table.to_pandas()
> Out[12]:
>    id   name other
> 0   0   zero  None
> 1   1    one  None
> 2   2    two  None
> 3   3  three  None
> 4   4   four  None
> 5   5   five  None
> 6   6    six  None
> 7   7  seven  None
> 8   8  eight  None
> 9   9   nine  None
> In [13]: pq.write_table(table, "/tmp/test.parquet", data_page_version="2.0")
> In [14]: data = ds.dataset("/tmp/test.parquet")
> In [15]: table = data.to_table(filter=ds.field("id").isin([1, 4, 7]))
> In [16]: table
> Out[16]:
> pyarrow.Table
> id: int64
> name: string
> other: null
> In [17]: table.to_pydict()
> Out[17]:
> {'id': [1, 4, 7],
>  'name': ['one', 'four', 'seven'],
>  'other': [None, None, None, None, None, None, None, None, None, None]}
> {code}
> The {{to_pydict}} method highlights the strange behavior: the {{id}} and 
> {{name}} columns have 3 elements, but the {{other}} column has all 10. When I 
> call {{to_pandas}} on the filtered table, the program crashes.
> This could be a C++ issue, but, since my examples are in Python, I 
> categorized it as a Python issue. Let me know if that's wrong and I'll note 
> that for the future.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to