Re: [PR] Add Arrow C streaming, DataFrame iteration, and OOM-safe streaming execution [datafusion-python]

via GitHub Tue, 28 Oct 2025 04:54:12 -0700


timsaucer commented on code in PR #1222:
URL: 
https://github.com/apache/datafusion-python/pull/1222#discussion_r2469239131



##########
python/datafusion/dataframe.py:
##########
@@ -1131,21 +1136,54 @@ def unnest_columns(self, *columns: str, preserve_nulls: 
bool = True) -> DataFram
         return DataFrame(self.df.unnest_columns(columns, 
preserve_nulls=preserve_nulls))
 
     def __arrow_c_stream__(self, requested_schema: object | None = None) -> 
object:
-        """Export an Arrow PyCapsule Stream.
+        """Export the DataFrame as an Arrow C Stream.
+
+        The DataFrame is executed using DataFusion's streaming APIs and 
exposed via
+        Arrow's C Stream interface. Record batches are produced incrementally, 
so the
+        full result set is never materialized in memory.
 
-        This will execute and collect the DataFrame. We will attempt to 
respect the
-        requested schema, but only trivial transformations will be applied 
such as only
-        returning the fields listed in the requested schema if their data 
types match
-        those in the DataFrame.
+        When ``requested_schema`` is provided, DataFusion applies only simple
+        projections such as selecting a subset of existing columns or 
reordering
+        them. Column renaming, computed expressions, or type coercion are not
+        supported through this interface.
 
         Args:
-            requested_schema: Attempt to provide the DataFrame using this 
schema.
+            requested_schema: Either a :py:class:`pyarrow.Schema` or an Arrow C
+                Schema capsule (``PyCapsule``) produced by
+                ``schema._export_to_c_capsule()``. The DataFrame will attempt 
to
+                align its output with the fields and order specified by this 
schema.
 
         Returns:
-            Arrow PyCapsule object.
+            Arrow ``PyCapsule`` object representing an ``ArrowArrayStream``.
+
+        Examples:
+            >>> schema = df.schema()
+            >>> stream = df.__arrow_c_stream__(schema)
+            >>> capsule = schema._export_to_c_capsule()
+            >>> stream = df.__arrow_c_stream__(capsule)

Review Comment:
   I think examples should be user facing. I don't think we expect users to 
ever interact directly with this function. Instead this function should make it 
seamless to share dataframe with other libraries that take arrow c stream. I 
recommend removing this example and instead just recommending the online 
documentation for usage examples.



##########
docs/source/user-guide/dataframe/index.rst:
##########
@@ -195,10 +195,119 @@ To materialize the results of your DataFrame operations:
     
     # Display results
     df.show()                         # Print tabular format to console
-    
+
     # Count rows
     count = df.count()
 
+Zero-copy streaming to Arrow-based Python libraries
+---------------------------------------------------
+
+DataFusion DataFrames implement the ``__arrow_c_stream__`` protocol, enabling
+zero-copy, lazy streaming into Arrow-based Python libraries. With the 
streaming 
+protocol, batches are produced on demand so you can process arbitrarily large 
+results without out-of-memory errors.

Review Comment:
   This is true to a point. It is still possible that the batches can lead to 
out of memory problems depending on the size of the batches and what the user 
is doing with them. I recommend removing the part "so you can process 
arbitrarily large results without out-of-memory errors." Similar to some of my 
other feedback, this statement mostly is oriented towards people who have 
already had a memory problem.



##########
docs/source/user-guide/io/arrow.rst:
##########
@@ -60,14 +60,22 @@ Exporting from DataFusion
 DataFusion DataFrames implement ``__arrow_c_stream__`` PyCapsule interface, so 
any
 Python library that accepts these can import a DataFusion DataFrame directly.
 
-.. warning::
-    It is important to note that this will cause the DataFrame execution to 
happen, which may be
-    a time consuming task. That is, you will cause a
-    :py:func:`datafusion.dataframe.DataFrame.collect` operation call to occur.
+.. note::
+    Invoking ``__arrow_c_stream__`` still triggers execution of the underlying
+    query, but batches are yielded incrementally rather than materialized all 
at
+    once in memory. Consumers can process the stream as it arrives, avoiding 
the
+    memory overhead of a full
+    :py:func:`datafusion.dataframe.DataFrame.collect`.
+
+    For an example of this streamed execution and its memory safety, see the
+    ``test_arrow_c_stream_large_dataset`` unit test in
+    :mod:`python.tests.test_io`.

Review Comment:
   This documentation is written from the perspective of a developer or user 
who has already encountered the issue of collection. For new users it probably 
is not particularly helpful. Instead I would remove the `note` annotator and 
remove the portion ", avoiding the memory overhead of a full 
:py:func:`datafusion.dataframe.DataFrame.collect`."
   
   nit: also remove the word "still" in the first line



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Add Arrow C streaming, DataFrame iteration, and OOM-safe streaming execution [datafusion-python]

Reply via email to