adrien-grl opened a new issue, #50012:
URL: https://github.com/apache/arrow/issues/50012

   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   **Bug description**
   
   When `pa.Table.from_pylist` is given a schema containing a 
`pa.ExtensionType` containing a `pa.list_` field, and the cumulative values in 
that list field across rows exceed int32 max, the call fails with:
    
   ```
   TypeError: Argument 'storage' has incorrect type (expected 
pyarrow.lib.Array, got pyarrow.lib.ChunkedArray)
   ```
   
   The message doesn't provide indication about the actual cause of the issue 
(for instance that it originates from the a `pa.list_` or a `pa.ExtensionType`).
   
   **Environment**
   - PyArrow 24.0.0
   - Python 3.12, Linux x86_64
     
   **Minimal steps to reproduce**
   
   The code requires roughly 3GB RAM.
   
   ```python
   import numpy as np
   import pyarrow as pa
   
   class FooExt(pa.ExtensionType):
       def __init__(self):
           super().__init__(
               pa.struct({"data": pa.list_(pa.uint8())}),
              "foo_img",
            )
   
       def __arrow_ext_serialize__(self):
           return b""
   
       @classmethod
       def __arrow_ext_deserialize__(cls, storage_type, serialized):
           return cls()
   
   pa.register_extension_type(FooExt())
   
   schema = pa.schema({"img": FooExt()})
   
   # 5 rows × 500M values = 2.5B > int32 max
   arr = np.zeros(500_000_000, dtype=np.uint8)
   rows = [{"img": {"data": arr}} for _ in range(5)]
   
   pa.Table.from_pylist(rows, schema=schema)
   # TypeError: Argument 'storage' has incorrect type
   #            (expected pyarrow.lib.Array, got pyarrow.lib.ChunkedArray)
   ```
   
   **Expected behavior**
   
   Either:
   1. An actionable error that names the column, identifies the int32-offset 
cause, and maybe even points at the escape routes (`pa.large_list`, smaller 
batches, or manual chunked construction), or
   2. A successful build that returns a `ChunkedArray<ExtensionArray>` whose 
chunks each fit in int32 offsets. 
   
   ### Component(s)
   
   Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to