kosiew commented on code in PR #1222:
URL:
https://github.com/apache/datafusion-python/pull/1222#discussion_r2483157734
##########
python/datafusion/dataframe.py:
##########
@@ -1131,21 +1136,54 @@ def unnest_columns(self, *columns: str, preserve_nulls:
bool = True) -> DataFram
return DataFrame(self.df.unnest_columns(columns,
preserve_nulls=preserve_nulls))
def __arrow_c_stream__(self, requested_schema: object | None = None) ->
object:
- """Export an Arrow PyCapsule Stream.
+ """Export the DataFrame as an Arrow C Stream.
+
+ The DataFrame is executed using DataFusion's streaming APIs and
exposed via
+ Arrow's C Stream interface. Record batches are produced incrementally,
so the
+ full result set is never materialized in memory.
- This will execute and collect the DataFrame. We will attempt to
respect the
- requested schema, but only trivial transformations will be applied
such as only
- returning the fields listed in the requested schema if their data
types match
- those in the DataFrame.
+ When ``requested_schema`` is provided, DataFusion applies only simple
+ projections such as selecting a subset of existing columns or
reordering
+ them. Column renaming, computed expressions, or type coercion are not
+ supported through this interface.
Args:
- requested_schema: Attempt to provide the DataFrame using this
schema.
+ requested_schema: Either a :py:class:`pyarrow.Schema` or an Arrow C
+ Schema capsule (``PyCapsule``) produced by
+ ``schema._export_to_c_capsule()``. The DataFrame will attempt
to
+ align its output with the fields and order specified by this
schema.
Returns:
- Arrow PyCapsule object.
+ Arrow ``PyCapsule`` object representing an ``ArrowArrayStream``.
+
+ Examples:
+ >>> schema = df.schema()
+ >>> stream = df.__arrow_c_stream__(schema)
+ >>> capsule = schema._export_to_c_capsule()
+ >>> stream = df.__arrow_c_stream__(capsule)
Review Comment:
Amended
##########
docs/source/user-guide/io/arrow.rst:
##########
@@ -60,14 +60,22 @@ Exporting from DataFusion
DataFusion DataFrames implement ``__arrow_c_stream__`` PyCapsule interface, so
any
Python library that accepts these can import a DataFusion DataFrame directly.
-.. warning::
- It is important to note that this will cause the DataFrame execution to
happen, which may be
- a time consuming task. That is, you will cause a
- :py:func:`datafusion.dataframe.DataFrame.collect` operation call to occur.
+.. note::
+ Invoking ``__arrow_c_stream__`` still triggers execution of the underlying
+ query, but batches are yielded incrementally rather than materialized all
at
+ once in memory. Consumers can process the stream as it arrives, avoiding
the
+ memory overhead of a full
+ :py:func:`datafusion.dataframe.DataFrame.collect`.
+
+ For an example of this streamed execution and its memory safety, see the
+ ``test_arrow_c_stream_large_dataset`` unit test in
+ :mod:`python.tests.test_io`.
Review Comment:
Amended
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]