from C [arrow]

via GitHub Mon, 12 Feb 2024 08:04:25 -0800


pitrou commented on code in PR #39985:
URL: https://github.com/apache/arrow/pull/39985#discussion_r1486364432



##########
python/pyarrow/table.pxi:
##########
@@ -4932,7 +4994,13 @@ cdef class Table(_Tabular):
         -------
         PyCapsule
         """
-        return self.to_reader().__arrow_c_stream__(requested_schema)
+        cdef Table table = self
+        if requested_schema is not None:
+            out_schema = Schema._import_from_c_capsule(requested_schema)
+            if self.schema != out_schema:
+                table = self.cast(out_schema)

Review Comment:
   Not strictly necessary, but it would be nicer (both for memory consumption 
and for latency) to cast each batch when required, rather than all the table up 
front.
   
   You could simply use `RecordBatchReader.from_batches` with a generator that 
casts each batch in turn. Something like:
   ```cython
           batches = table.to_batches()
           if requested_schema is not None:
               out_schema = Schema._import_from_c_capsule(requested_schema)
               if self.schema != out_schema:
                   batches = (batch.cast(out_schema) for batch in batches)
   
           return RecordBatchReader.from_batches(batches)
   ```
   
   (or you can fold the functionality directly in `PyRecordBatchReader`)
   



##########
python/pyarrow/table.pxi:
##########
@@ -1327,6 +1327,68 @@ cdef class ChunkedArray(_PandasConvertible):
             result += self.chunk(i).to_pylist()
         return result
 
+    def __arrow_c_stream__(self, requested_schema=None):
+        """
+        Export to a C ArrowArrayStream PyCapsule.
+
+        Parameters
+        ----------
+        requested_schema : PyCapsule, default None
+            The schema to which the stream should be casted, passed as a
+            PyCapsule containing a C ArrowSchema representation of the
+            requested schema.
+
+        Returns
+        -------
+        PyCapsule
+            A capsule containing a C ArrowArrayStream struct.
+        """
+        cdef:
+            ArrowArrayStream* c_stream = NULL
+            ChunkedArray chunked = self
+
+        if requested_schema is not None:
+            out_type = DataType._import_from_c_capsule(requested_schema)
+            if self.type != out_type:
+                chunked = self.cast(out_type)

Review Comment:
   Same remark as in `Table.__arrow_c_stream__`.



##########
python/pyarrow/tests/test_cffi.py:
##########
@@ -601,3 +601,43 @@ def test_roundtrip_batch_reader_capsule():
     assert imported_reader.read_next_batch().equals(batch)
     with pytest.raises(StopIteration):
         imported_reader.read_next_batch()
+
+
+def test_roundtrip_batch_reader_capsule_requested_schema():
+    batch = make_batch()
+    requested_schema = pa.schema([('ints', pa.list_(pa.int64()))])
+    requested_capsule = requested_schema.__arrow_c_schema__()
+    # RecordBatch has no cast() method

Review Comment:
   This should be fixed instead of working around it.
   



##########
python/pyarrow/table.pxi:
##########
@@ -4932,7 +4994,13 @@ cdef class Table(_Tabular):
         -------
         PyCapsule
         """
-        return self.to_reader().__arrow_c_stream__(requested_schema)
+        cdef Table table = self

Review Comment:
   This is probably not required.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] GH-39984: [Python] Add ChunkedArray import/export to/from C [arrow]

Reply via email to