Re: [PR] GH-38010: [Python] Construct pyarrow.Field and ChunkedArray through Arrow PyCapsule Protocol [arrow]

via GitHub Thu, 04 Apr 2024 02:02:57 -0700


pitrou commented on code in PR #40818:
URL: https://github.com/apache/arrow/pull/40818#discussion_r1540723914



##########
python/pyarrow/types.pxi:
##########
@@ -3504,6 +3504,14 @@ def field(name, type, bint nullable=True, metadata=None):
     >>> pa.struct([field])
     StructType(struct<key: int32>)
     """
+    if hasattr(name, "__arrow_c_schema__"):
+        if not (type is None and metadata is None):
+            raise ValueError(

Review Comment:
   Make this `TypeError`?



##########
python/pyarrow/table.pxi:
##########
@@ -1437,6 +1451,21 @@ def chunked_array(arrays, type=None):
 
     if isinstance(arrays, Array):
         arrays = [arrays]
+    elif hasattr(arrays, "__arrow_c_stream__"):
+        if type is not None:
+            requested_type = type.__arrow_c_schema__()
+        else:
+            requested_type = None
+        capsule = arrays.__arrow_c_stream__(requested_type)
+        result = ChunkedArray._import_from_c_capsule(capsule)
+        if type is not None and result.type != type:

Review Comment:
   Why not use `requested_type` here?



##########
docs/source/python/extending_types.rst:
##########
@@ -37,14 +37,14 @@ under the hood, you can implement the following methods on 
those objects:
 
 - ``__arrow_c_schema__`` for schema or type-like objects.
 - ``__arrow_c_array__`` for arrays and record batches (contiguous tables).
-- ``__arrow_c_stream__`` for chunked tables or streams of data.
+- ``__arrow_c_stream__`` for chunked arrays or tables, or streams of data.

Review Comment:
   ```suggestion
   - ``__arrow_c_stream__`` for chunked arrays, tables and streams of data.
   ```



##########
python/pyarrow/tests/test_table.py:
##########
@@ -493,6 +493,47 @@ def test_recordbatch_dunder_init():
         pa.RecordBatch()
 
 
+def test_chunked_array_c_array_interface():
+    class ArrayWrapper:
+        def __init__(self, array):
+            self.array = array
+
+        def __arrow_c_array__(self, requested_schema=None):
+            return self.array.__arrow_c_array__(requested_schema)
+
+    data = pa.array([1, 2, 3], pa.int64())
+    chunked = pa.chunked_array([data])
+    wrapper = ArrayWrapper(data)
+
+    # Can roundtrip through the wrapper.
+    result = pa.chunked_array(wrapper)
+    assert result == chunked
+
+    # Can also import with a type that implementer can cast to.
+    result = pa.chunked_array(wrapper, type=pa.int16())

Review Comment:
   Also test with a `TypeWrapper` that would wrap the `pa.int16` field and only 
expose `__arrow_c_schema__`?



##########
python/pyarrow/tests/test_cffi.py:
##########
@@ -692,8 +692,14 @@ def 
test_roundtrip_chunked_array_capsule_requested_schema():
     imported_chunked = pa.ChunkedArray._import_from_c_capsule(capsule)
     assert imported_chunked == chunked
 
-    # Casting to something else should error
+    # Casting to something else should error if not possible
     requested_type = pa.binary()
     requested_capsule = requested_type.__arrow_c_schema__()
-    with pytest.raises(NotImplementedError):
+    capsule = chunked.__arrow_c_stream__(requested_capsule)
+    imported_chunked = pa.ChunkedArray._import_from_c_capsule(capsule)
+    assert imported_chunked == chunked.cast(pa.binary())
+
+    requested_type = pa.int64()
+    requested_capsule = requested_type.__arrow_c_schema__()
+    with pytest.raises(ValueError):

Review Comment:
   Check the error message as well?



##########
python/pyarrow/tests/test_types.py:
##########
@@ -1335,3 +1335,25 @@ def __arrow_c_schema__(self):
     wrapped_schema = Wrapper(schema)
 
     assert pa.schema(wrapped_schema) == schema
+
+
+def test_field_import_c_schema_interface():
+    class Wrapper:
+        def __init__(self, field):
+            self.field = field
+
+        def __arrow_c_schema__(self):
+            return self.field.__arrow_c_schema__()
+
+    field = pa.field("field_name", pa.int32(), metadata={"key": "value"})
+    wrapped_field = Wrapper(field)
+
+    assert pa.field(wrapped_field) == field
+
+    with pytest.raises(ValueError, match="cannot specify 'type'"):

Review Comment:
   Should be `TypeError`?



##########
docs/source/python/extending_types.rst:
##########
@@ -53,6 +53,31 @@ support for this protocol by checking for the presence of 
those methods, and
 therefore accept any Arrow data (instead of harcoding support for a specific
 Arrow producer such as PyArrow).
 
+For consuming data through this protocol with PyArrow, the following 
constructors
+can be used to create the various PyArrow objects:
+
++----------------------------+-----------------------------------------------+--------------------+
+| Result class               | Mapped Arrow type                             | 
Supported protocol |

Review Comment:
   "Mapped Arrow type" as a column heading is a bit strange as the listed APIs 
are not types at all. 



##########
python/pyarrow/table.pxi:
##########
@@ -1344,17 +1344,28 @@ cdef class ChunkedArray(_PandasConvertible):
             A capsule containing a C ArrowArrayStream struct.
         """
         cdef:
+            ChunkedArray chunked
             ArrowArrayStream* c_stream = NULL
 
         if requested_schema is not None:
-            out_type = DataType._import_from_c_capsule(requested_schema)
-            if self.type != out_type:
-                raise NotImplementedError("Casting to requested_schema")
+            target_type = DataType._import_from_c_capsule(requested_schema)
+
+            if target_type != self.type:
+                try:
+                    chunked = self.cast(target_type, safe=True)

Review Comment:
   We would ideally like to cast one chunk at a time when the stream is being 
read, to reduce memory consumption. Perhaps open a GH issue for it?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] GH-38010: [Python] Construct pyarrow.Field and ChunkedArray through Arrow PyCapsule Protocol [arrow]

Reply via email to