[GitHub] [arrow] westonpace commented on pull request #13028: ARROW-16083]: [WIP][C++] Implement AsofJoin execution node

GitBox Fri, 29 Apr 2022 21:06:34 -0700


westonpace commented on PR #13028:
URL: https://github.com/apache/arrow/pull/13028#issuecomment-1113910387


   Ok, managed to look at it a bit more today.
   
   Your sidecar processing thread is probably fine for a first approach.  
Eventually we will probably want to get rid of it with something that looks 
like:
   
   ```
   InputRecieved(...) {
     // Called for every input
     StoreBatch();
     // Only called sometimes
     if (HaveEnoughToProcess()) {
       Process();
     }
   }
   ```
   
   The main advantage of the above approach is just to avoid scheduling 
conflicts / context switches that we would have by introducing another busy 
thread.
   
   Yes, I misunderstood and my concern about running out of memory was not 
valid.
   
   The `MemoStore` approach seems sound but I bet we could come up with 
something more efficient later.  Just thinking out loud I would think something 
like...
   
   ```
    * When a batch arrives immediately compute a column of hashes based on the 
key columns (this happens as part of InputReceived before any storage or 
processing.  Doing it in InputReceived will be important when we get to micro 
batches because you will want to compute the hashes while the data is still in 
the CPU cache).  Store the column of hashes with the batch.
    * In processing, take your LHS time column and for each RHS table do 
something like...
   
   MaterializeAsOf(reference_times, rhs_hashes, rhs_times rhs_columns) {
     vector<row_idx_t> rhs_row_indices = Magic(reference_times, rhs_hashes, 
rhs_times);
     rhs_columns = []
     for (rhs_column in rhs_columns) {
       rhs_columns.append(Take(rhs_column, rhs_row_indices));
     }
     return rhs_columns;
   }
   ```
   
   I'd have to think a bit more on how `Magic` works (and I think it would 
require an unordered_map) but if you didn't have a key column it would look 
something like...
   
   ```
   Magic([1, 5, 10], [2, 4, 6, 8, 10]) -> [-1, 1, 4] // I.e. index of 
latest_rhs_time for each reference_time
   ```
   
   Then `Take` would just be normal take.  But maybe you also have something 
like...
   
   ```
   Magic([1, 2, 3, 4, 5, 6, 7, 8], [2, 4, 6]) -> [-1, 0, 0, 1, 1, 2, 2, 2]
   ```
   
   I guess `Take` is just a normal `Take` there too.  That would save you from 
having to write the materialize logic yourself.  This also keeps things 
"column-based" which will allow for better utilization of SIMD.  Others will 
probably have much smarter opinions too :smile: so we can always play with 
performance later.
   
   I could be way off base here too.  I'll leave it to you on how you want to 
proceed next.  If you want to do more optimization and benchmarking you can.  
If you want to get something that works going first then I think we can focus 
on getting this cleaned up and migrated over to the Arrow style (see 
https://arrow.apache.org/docs/developers/cpp/development.html#code-style-linting-and-ci
 ).  We will also want some more tests.  I think my personal preference would 
be to get something in that is clean with a lot of tests first.  Then get some 
benchmarks in place.  Then we can play around with performance.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] westonpace commented on pull request #13028: ARROW-16083]: [WIP][C++] Implement AsofJoin execution node

Reply via email to