Re: [D] How to get the raw Arrow stream of matched rows? [datafusion]

via GitHub Tue, 25 Nov 2025 05:11:52 -0800


GitHub user dgildeh closed a discussion: How to get the raw Arrow stream of 
matched rows?


I'm trying to query some Parquet files on S3, and use Datafusion's partitioned 
streams API to pass all the record batches as they're available to a callback 
function like below:

```
let streams = df.execute_stream_partitioned().await?
for stream in streams {
        let batch = common::collect(stream).await?;
        let results = record_batches_to_json_rows(&batch)?;
        rows.append(&mut results.clone());
        if let Some(callback) = &callback {
            if !results.is_empty() {
                // Only call callback function if there are results to process
                callback(results).await;
            }
        }
    }
```

However, what I'd really like to do, is get a stream of rows across all threads 
that match a query and pass them to the callback function as soon as they're 
found (so the callback will get called every time a new row matches with the 
row data instead of waiting for the partition finish collecting rows).

Is this possible and if so how can I do this using Datafusion? From stepping 
into the execute_stream_partitioned() function it looks like I may have to make 
my own physical plan and plug it into Datafusion, but that feels like a lot of 
work for a Rust/Datafusion newbie so hoping there's an easier way/API I can 
hook into to do this.

Thanks!

GitHub link: https://github.com/apache/datafusion/discussions/6870

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: 
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [D] How to get the raw Arrow stream of matched rows? [datafusion]

Reply via email to