Re: [I] Filtering corrupts data in column containing an array [arrow]

via GitHub Thu, 26 Mar 2026 07:49:46 -0700


AlenkaF commented on issue #49392:
URL: https://github.com/apache/arrow/issues/49392#issuecomment-4135599337


   Thank you for opening the issue! This looks like a serious bug to me.
   From testing with PyArrow and investigating with Copilot, it looks like 
overflow happens in the "offset arithmetic". 
   
   Take for example a table with `large_list` instead of `list` column:
   
   ```python
   >>> my_schema = pa.schema([
   ...     pa.field('id', pa.int64()),
   ...     pa.field('text', pa.string()),
   ...     pa.field('numbers', pa.large_list(pa.float64()))])
   >>> tbl = pa.table([ids, texts, numbers], schema=my_schema)
   
   >>> tbl.filter(pc.field("id") == N - 1)
   pyarrow.Table
   id: int64
   text: string
   numbers: large_list<item: double>
     child 0, item: double
   ----
   id: [[],[],...,[],[499999]]
   text: [[],[],...,[],["Row 499999 with data"]]
   numbers: 
[[],[],...,[],[[499999,0.2806802660498191,0.18948458094650322,0.6611584406407851,0.340530752637791,...,0.19918275933231844,0.42906946186903017,0.49644347191463034,0.3171420306034032,0.13584405454197468]]]
   ```
   
   so this bug would need a fix in the C++ filter kernel.
   
   For now, I think the schema definition or casting list column to large list 
can be used as a workaround.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] Filtering corrupts data in column containing an array [arrow]

Reply via email to