westonpace commented on PR #13028:
URL: https://github.com/apache/arrow/pull/13028#issuecomment-1113910387
Ok, managed to look at it a bit more today.
Your sidecar processing thread is probably fine for a first approach.
Eventually we will probably want to get rid of it with something that looks
like:
```
InputRecieved(...) {
// Called for every input
StoreBatch();
// Only called sometimes
if (HaveEnoughToProcess()) {
Process();
}
}
```
The main advantage of the above approach is just to avoid scheduling
conflicts / context switches that we would have by introducing another busy
thread.
Yes, I misunderstood and my concern about running out of memory was not
valid.
The `MemoStore` approach seems sound but I bet we could come up with
something more efficient later. Just thinking out loud I would think something
like...
```
* When a batch arrives immediately compute a column of hashes based on the
key columns (this happens as part of InputReceived before any storage or
processing. Doing it in InputReceived will be important when we get to micro
batches because you will want to compute the hashes while the data is still in
the CPU cache). Store the column of hashes with the batch.
* In processing, take your LHS time column and for each RHS table do
something like...
MaterializeAsOf(reference_times, rhs_hashes, rhs_times rhs_columns) {
vector<row_idx_t> rhs_row_indices = Magic(reference_times, rhs_hashes,
rhs_times);
rhs_columns = []
for (rhs_column in rhs_columns) {
rhs_columns.append(Take(rhs_column, rhs_row_indices));
}
return rhs_columns;
}
```
I'd have to think a bit more on how `Magic` works (and I think it would
require an unordered_map) but if you didn't have a key column it would look
something like...
```
Magic([1, 5, 10], [2, 4, 6, 8, 10]) -> [-1, 1, 4] // I.e. index of
latest_rhs_time for each reference_time
```
Then `Take` would just be normal take. But maybe you also have something
like...
```
Magic([1, 2, 3, 4, 5, 6, 7, 8], [2, 4, 6]) -> [-1, 0, 0, 1, 1, 2, 2, 2]
```
I guess `Take` is just a normal `Take` there too. That would save you from
having to write the materialize logic yourself. This also keeps things
"column-based" which will allow for better utilization of SIMD. Others will
probably have much smarter opinions too :smile: so we can always play with
performance later.
I could be way off base here too. I'll leave it to you on how you want to
proceed next. If you want to do more optimization and benchmarking you can.
If you want to get something that works going first then I think we can focus
on getting this cleaned up and migrated over to the Arrow style (see
https://arrow.apache.org/docs/developers/cpp/development.html#code-style-linting-and-ci
). We will also want some more tests. I think my personal preference would
be to get something in that is clean with a lot of tests first. Then get some
benchmarks in place. Then we can play around with performance.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]