suibianwanwank commented on issue #15406:
URL: https://github.com/apache/datafusion/issues/15406#issuecomment-2762027162
Hi, @2010YOUY01. I have read and tried to understand both
SortMergeJoinStream and GroupedHashAggregateStream (though I still have some
uncertainties). I have some initial thoughts on organizing the structure, but
I’m not sure about the right level of granularity.
One idea is to introduce a SortMergeState, which would roughly comprise:
```rust
struct SortMergeState {
/// State of streamed
pub streamed_state: StreamedState,
/// State of buffered
pub buffered_state: BufferedState,
/// Staging output size, including output batches and staging joined
results.
/// Increased when we put rows into buffer and decreased after we
actually output batches.
/// Used to trigger output when sufficient rows are ready
pub output_size: usize,
/// Target output batch size
pub batch_size: usize,
/// Current processing record batch of streamed
pub streamed_batch: StreamedBatch,
/// Current buffered data
pub buffered_data: BufferedData,
/// (used in outer join) Is current streamed row joined at least once?
pub streamed_joined: bool,
/// (used in outer join) Is current buffered batches joined at least
once?
pub buffered_joined: bool,
/// A unique number for each batch
pub streamed_batch_counter: AtomicUsize,
/// Staging output array builders
pub staging_output_record_batches: JoinedRecordBatches,
/// Output buffer. Currently used by filtering as it requires double
buffering
/// to avoid small/empty batches. Non-filtered join outputs directly
from `staging_output_record_batches.batches`
pub output: RecordBatch,
}
```
These contain the parts of the stream, buffer, and output that change during
the merge process.
Another idea is to organize the contents of the buffer, stream and output
separately, which would roughly comprise:
```rust
struct StreamContext {
/// State of streamed
pub streamed_state: StreamedState,
/// Current processing record batch of streamed
pub streamed_batch: StreamedBatch,
/// (used in outer join) Is current streamed row joined at least once?
pub streamed_joined: bool,
/// Join key columns of streamed
pub on_streamed: Vec<PhysicalExprRef>,
/// Streamed data stream
pub streamed: SendableRecordBatchStream,
/// Input schema of streamed
pub streamed_schema: SchemaRef,
}
struct BufferContext {
//...
}
struct OutputState {
//...
}
```
Do you have any suggestions on which approach would be clearer and more
readable? :)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]