kylebarron commented on code in PR #1222: URL: https://github.com/apache/datafusion-python/pull/1222#discussion_r2319619566
########## docs/source/user-guide/dataframe/index.rst: ########## @@ -145,10 +145,39 @@ To materialize the results of your DataFrame operations: # Display results df.show() # Print tabular format to console - + # Count rows count = df.count() +PyArrow Streaming +----------------- + +DataFusion DataFrames implement the ``__arrow_c_stream__`` protocol, enabling +zero-copy streaming into libraries like `PyArrow <https://arrow.apache.org/>`_. +Earlier versions eagerly converted the entire DataFrame when exporting to +PyArrow, which could exhaust memory on large datasets. With streaming, batches +are produced lazily so you can process arbitrarily large results without +out-of-memory errors. + +.. code-block:: python + + import pyarrow as pa + + # Create a PyArrow RecordBatchReader without materializing all batches + reader = pa.RecordBatchReader._import_from_c_capsule(df.__arrow_c_stream__()) + for batch in reader: + ... # process each batch as it is produced + +DataFrames are also iterable, yielding :class:`pyarrow.RecordBatch` objects +lazily so you can loop over results directly: + +.. code-block:: python + + for batch in df: + ... # process each batch as it is produced Review Comment: We already have our own `RecordBatch` class: https://datafusion.apache.org/python/autoapi/datafusion/record_batch/index.html#datafusion.record_batch.RecordBatch Also, we should ensure that the dunder methods are rendered in the docs. It doesn't look like they are currently. (Or maybe the dunder methods on that `RecordBatch` aren't documented?) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org