ahmed-mez opened a new issue, #18773:
URL: https://github.com/apache/datafusion/issues/18773
### Is your feature request related to a problem or challenge?
For long-running aggregation queries, users want to observe intermediate
results as the query progresses rather than waiting for the final answer. This
enables:
- Progress monitoring: Show users that queries are making progress and
haven't stalled
- Early insights: Start analyzing partial results before the query completes
- Better UX: Display progressive updates in interactive applications
(dashboards, notebooks, Jupyter)
- Approximate query processing: Return "good enough" answers quickly, refine
as more data is processed
Other mature query engines provide various mechanisms for observing query
progress or returning partial results.
### Describe the solution you'd like
We'd like to explore ways to observe intermediate aggregation state for
queries with unordered input, particularly for grouped aggregations.
Some properties that seem important:
- Non-destructive: Observing intermediate state shouldn't affect final
results
- Periodic: Ability to check state at intervals (time-based or
batch-count-based)
- Efficient: Overhead should be proportional to data observed, not
re-processing
- Opt-in: Feature should be optional and not impact queries that don't use it
We've done some preliminary exploration and have ideas. We explored
callback-based peeking with a user-facing API like this:
```rust
let peek_config = IntermediatePeekConfig::new(1000, |context: PeekContext| {
// Receive intermediate results as Arrow RecordBatch
handle_intermediate_results(&context.intermediate_batch);
Ok(())
});
let agg_exec = agg_exec.with_intermediate_peek_config(Some(peek_config));
```
And extending existing traits with non-destructive peek methods:
```rust
pub trait GroupValues: Send {
// ... existing methods ...
fn peek(&self, num_groups: usize) -> Result<Vec<ArrayRef>> {
not_impl_err!("peek not supported") // Default implementation
}
}
pub trait GroupsAccumulator: Send {
// ... existing methods ...
fn peek_evaluate(&self, num_groups: usize) -> Result<ArrayRef> {
not_impl_err!("peek_evaluate not supported") // Default
implementation
}
}
```
We're very interested in hearing from the community about:
- Whether this use case resonates with others
- Potential approaches we may have missed
- How this might fit with DataFusion's architecture and future direction
- Whether there are existing extension points we could leverage
### Describe alternatives you've considered
_No response_
### Additional context
We believe this feature would benefit the broader DataFusion community
beyond our specific use case (interactive analytics, approximate query
processing, debugging). We're looking for:
- Design feedback: Which approach aligns best with DataFusion's architecture?
- API suggestions: Better ways to expose this functionality?
- Alternative approaches: Are there existing mechanisms we've overlooked?
- Collaboration: Would DataFusion maintainers/community be interested in
this feature?
- Scope guidance: Should we aim for comprehensive support or start with a
minimal viable feature?
If the community sees value in this feature and we can align on an approach,
we're willing to invest engineering effort to implement it properly. We'd
appreciate guidance from maintainers and looking forward to the discussion!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]