alamb commented on issue #8130:
URL:
https://github.com/apache/arrow-datafusion/issues/8130#issuecomment-1809237585
I started hacking on this locally -- the basic structure I have in mind
looks something like this (I need to sort out the OnceFuture stuff, which is
awkward at the moment).
But the idea is to wrap up the state needed to build output into a separate
enum like the following:
```
/// State machine for creating output for HashJoin
///
/// TODO Add memory reservation for intermediate rows
enum HashJoinOutput {
/// output phase has not yet started, input is
ReadingInput {
/// future which builds hash table from left side
left_fut: OnceFut<JoinLeftData>,
},
/// output phase has started, but have no probe batch
Ready {
// TODO make this into the proper state
left_fut: OnceFut<JoinLeftData>,
},
/// and output is being built from probe batches
Probing {
data: JoinLeftData,
},
/// emitting any final unmatched indices, if any (depending on the join
type)
Unmatched {
//
data: JoinLeftData,
},
/// Input is complete, and output is complete
Done,
}
```
Then I think adding the logic to incrementally compute the matching indices
is more tractable (though as you say @korowa we'll still have to protect
against pathalogical cases where each input row in the probe batch matches all
the rows in the hashtable).
I think this will take me a few days to code up realistically, and
https://github.com/apache/arrow-datafusion/issues/8078 is higher priority for
me. However, I think we'll be able to make this work
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]