Re: [PR] Add Arrow C streaming, DataFrame iteration, and OOM-safe streaming execution [datafusion-python]

via GitHub Sun, 14 Sep 2025 08:30:20 -0700


timsaucer commented on code in PR #1222:
URL: 
https://github.com/apache/datafusion-python/pull/1222#discussion_r2347314377



##########
python/datafusion/dataframe.py:
##########
@@ -1105,21 +1110,55 @@ def unnest_columns(self, *columns: str, preserve_nulls: 
bool = True) -> DataFram
         return DataFrame(self.df.unnest_columns(columns, 
preserve_nulls=preserve_nulls))
 
     def __arrow_c_stream__(self, requested_schema: object | None = None) -> 
object:
-        """Export an Arrow PyCapsule Stream.
+        """Export the DataFrame as an Arrow C Stream.
+
+        The DataFrame is executed using DataFusion's streaming APIs and 
exposed via
+        Arrow's C Stream interface. Record batches are produced incrementally, 
so the
+        full result set is never materialized in memory.
 
-        This will execute and collect the DataFrame. We will attempt to 
respect the
-        requested schema, but only trivial transformations will be applied 
such as only
-        returning the fields listed in the requested schema if their data 
types match
-        those in the DataFrame.
+        When ``requested_schema`` is provided, DataFusion applies only simple
+        projections such as selecting a subset of existing columns or 
reordering
+        them. Column renaming, computed expressions, or type coercion are not
+        supported through this interface.
 
         Args:
-            requested_schema: Attempt to provide the DataFrame using this 
schema.
+            requested_schema: Either a :py:class:`pyarrow.Schema` or an Arrow C
+                Schema capsule (``PyCapsule``) produced by
+                ``schema._export_to_c_capsule()``. The DataFrame will attempt 
to
+                align its output with the fields and order specified by this 
schema.
 
         Returns:
-            Arrow PyCapsule object.
+            Arrow ``PyCapsule`` object representing an ``ArrowArrayStream``.
+
+        Examples:
+            >>> schema = df.schema()
+            >>> stream = df.__arrow_c_stream__(schema)
+            >>> capsule = schema._export_to_c_capsule()
+            >>> stream = df.__arrow_c_stream__(capsule)
+
+        Notes:
+            The Arrow C Data Interface PyCapsule details are documented by 
Apache
+            Arrow and can be found at:
+            
https://arrow.apache.org/docs/format/CDataInterface/PyCapsuleInterface.html
         """
+        # ``DataFrame.__arrow_c_stream__`` in the Rust extension leverages
+        # ``execute_stream_partitioned`` under the hood to stream batches while
+        # preserving the original partition order.
         return self.df.__arrow_c_stream__(requested_schema)
 
+    def __iter__(self) -> Iterator[RecordBatch]:
+        """Return an iterator over this DataFrame's record batches."""
+        return iter(self.execute_stream())
+
+    def __aiter__(self) -> AsyncIterator[RecordBatch]:
+        """Return an async iterator over this DataFrame's record batches.
+
+        `RecordBatchStream` implements ``__aiter__``, so return the stream
+        directly to remain compatible with Python < 3.10 (this project
+        supports Python >= 3.6).

Review Comment:
   I think our minimum supported version is 3.9 right now



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Add Arrow C streaming, DataFrame iteration, and OOM-safe streaming execution [datafusion-python]

Reply via email to