GitHub user dgildeh closed a discussion: How to get the raw Arrow stream of
matched rows?
I'm trying to query some Parquet files on S3, and use Datafusion's partitioned
streams API to pass all the record batches as they're available to a callback
function like below:
```
let streams = df.execute_stream_partitioned().await?
for stream in streams {
let batch = common::collect(stream).await?;
let results = record_batches_to_json_rows(&batch)?;
rows.append(&mut results.clone());
if let Some(callback) = &callback {
if !results.is_empty() {
// Only call callback function if there are results to process
callback(results).await;
}
}
}
```
However, what I'd really like to do, is get a stream of rows across all threads
that match a query and pass them to the callback function as soon as they're
found (so the callback will get called every time a new row matches with the
row data instead of waiting for the partition finish collecting rows).
Is this possible and if so how can I do this using Datafusion? From stepping
into the execute_stream_partitioned() function it looks like I may have to make
my own physical plan and plug it into Datafusion, but that feels like a lot of
work for a Rust/Datafusion newbie so hoping there's an easier way/API I can
hook into to do this.
Thanks!
GitHub link: https://github.com/apache/datafusion/discussions/6870
----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]