[
https://issues.apache.org/jira/browse/ARROW-10027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17197535#comment-17197535
]
Joris Van den Bossche commented on ARROW-10027:
-----------------------------------------------
So it seems this is a bug not directly in the Dataset code, but in the filter
operation. Also when manually filtering a RecordBatch, it incorrectly returns a
batch with the null column not being filtered:
{code}
table = pa.Table.from_arrays(
arrays=[
pa.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]),
pa.array(["zero", "one", "two", "three", "four", "five", "six",
"seven", "eight", "nine"]),
pa.array([None, None, None, None, None, None, None, None, None, None]),
],
names=["id", "name", "other"],
)
batch = table.to_batches()[0]
{code}
{code}
In [32]: batch
Out[32]:
pyarrow.RecordBatch
id: int64
name: string
other: null
In [33]: batch.num_rows
Out[33]: 10
In [34]: filtered_batch = batch.filter(pa.array([True, False]*5))
In [35]: filtered_table.num_rows
Out[35]: 5
In [36]: filtered_batch.column(2)
Out[36]:
<pyarrow.lib.NullArray object at 0x7fdf9c4002e8>
10 nulls
In [37]: len(filtered_batch.column(2))
Out[37]: 10
{code}
Directly filtering on the array or chunked array or on a Table seems to work,
though:
{code}
In [38]: filtered_table = table.filter(pa.array([True, False]*5))
In [39]: filtered_table.num_rows
Out[39]: 5
In [40]: filtered_table['other']
Out[40]:
<pyarrow.lib.ChunkedArray object at 0x7fdf9c3b9938>
[
5 nulls
]
In [41]: chunked_array = table['other']
In [42]: chunked_array
Out[42]:
<pyarrow.lib.ChunkedArray object at 0x7fdf9c391410>
[
10 nulls
]
In [43]: chunked_array.filter(pa.array([True, False]*5))
Out[43]:
<pyarrow.lib.ChunkedArray object at 0x7fdf9c352c50>
[
5 nulls
]
In [44]: chunked_array.chunks[0].filter(pa.array([True, False]*5))
Out[44]:
<pyarrow.lib.NullArray object at 0x7fdf9c362e88>
5 nulls
{code}
> [Python] Incorrect null column returned when using a dataset filter
> expression.
> -------------------------------------------------------------------------------
>
> Key: ARROW-10027
> URL: https://issues.apache.org/jira/browse/ARROW-10027
> Project: Apache Arrow
> Issue Type: Bug
> Components: Python
> Affects Versions: 1.0.1
> Reporter: Troy Zimmerman
> Priority: Major
>
> When using dataset filter expressions (which I <3) with Parquet files, entire
> {{null}} columns are returned, rather than rows that matched other columns in
> the filter.
> Here's an example.
> {code:python}
> In [7]: import pyarrow as pa
> In [8]: import pyarrow.dataset as ds
> In [9]: import pyarrow.parquet as pq
> In [10]: table = pa.Table.from_arrays(
> ...: arrays=[
> ...: pa.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]),
> ...: pa.array(["zero", "one", "two", "three", "four", "five", "six",
> "seven", "eight", "nine"]),
> ...: pa.array([None, None, None, None, None, None, None, None, None,
> None]),
> ...: ],
> ...: names=["id", "name", "other"],
> ...: )
> In [11]: table
> Out[11]:
> pyarrow.Table
> id: int64
> name: string
> other: null
> In [12]: table.to_pandas()
> Out[12]:
> id name other
> 0 0 zero None
> 1 1 one None
> 2 2 two None
> 3 3 three None
> 4 4 four None
> 5 5 five None
> 6 6 six None
> 7 7 seven None
> 8 8 eight None
> 9 9 nine None
> In [13]: pq.write_table(table, "/tmp/test.parquet", data_page_version="2.0")
> In [14]: data = ds.dataset("/tmp/test.parquet")
> In [15]: table = data.to_table(filter=ds.field("id").isin([1, 4, 7]))
> In [16]: table
> Out[16]:
> pyarrow.Table
> id: int64
> name: string
> other: null
> In [17]: table.to_pydict()
> Out[17]:
> {'id': [1, 4, 7],
> 'name': ['one', 'four', 'seven'],
> 'other': [None, None, None, None, None, None, None, None, None, None]}
> {code}
> The {{to_pydict}} method highlights the strange behavior: the {{id}} and
> {{name}} columns have 3 elements, but the {{other}} column has all 10. When I
> call {{to_pandas}} on the filtered table, the program crashes.
> This could be a C++ issue, but, since my examples are in Python, I
> categorized it as a Python issue. Let me know if that's wrong and I'll note
> that for the future.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)