[I] Make __arrow_c_stream__() not collect internally [datafusion-python]

via GitHub Mon, 03 Feb 2025 02:46:10 -0800


matko opened a new issue, #1011:
URL: https://github.com/apache/datafusion-python/issues/1011


   **Is your feature request related to a problem or challenge? Please describe 
what you are trying to do.**
   I'm trying to pass the result of a query into my rust code. Some of the 
queries I'm doing produce a lot of data, and I would like to process this in a 
streaming way, without first loading the entire query result into memory (where 
it might not even fit).
   
   Dataframe has a function `__arrow_c_stream__()`, which can be used to cross 
the FFI boundary and get dataframe results into a native component. 
Unfortunately, this calls `.collect()` internally. This means I can't actually 
stream over the results while keeping the memory footprint low. I need to be 
able to load my entire dataset in memory, and the rest of my processing logic 
has to wait for this to complete before it can start.
   
   **Describe the solution you'd like**
   I would like `__arrow_c_stream__()` or a similar function to produce a 
`RecordBatchReader` or even a `RecordBatchStream` (which also appears to be 
FFI-wrapped), which streams the query result without first collecting into 
memory.
   
   **Describe alternatives you've considered**
   The alternative is accepting that using results from python in rust will 
always require a collect on the python side first. Given that the 
infrastructure seems to be in place to pass around readers and streams, this 
seems silly.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] Make __arrow_c_stream__() not collect internally [datafusion-python]

Reply via email to