AlenkaF commented on issue #49392:
URL: https://github.com/apache/arrow/issues/49392#issuecomment-4135599337
Thank you for opening the issue! This looks like a serious bug to me.
From testing with PyArrow and investigating with Copilot, it looks like
overflow happens in the "offset arithmetic".
Take for example a table with `large_list` instead of `list` column:
```python
>>> my_schema = pa.schema([
... pa.field('id', pa.int64()),
... pa.field('text', pa.string()),
... pa.field('numbers', pa.large_list(pa.float64()))])
>>> tbl = pa.table([ids, texts, numbers], schema=my_schema)
>>> tbl.filter(pc.field("id") == N - 1)
pyarrow.Table
id: int64
text: string
numbers: large_list<item: double>
child 0, item: double
----
id: [[],[],...,[],[499999]]
text: [[],[],...,[],["Row 499999 with data"]]
numbers:
[[],[],...,[],[[499999,0.2806802660498191,0.18948458094650322,0.6611584406407851,0.340530752637791,...,0.19918275933231844,0.42906946186903017,0.49644347191463034,0.3171420306034032,0.13584405454197468]]]
```
so this bug would need a fix in the C++ filter kernel.
For now, I think the schema definition or casting list column to large list
can be used as a workaround.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]