westonpace commented on PR #13028:
URL: https://github.com/apache/arrow/pull/13028#issuecomment-1112696020

   Thank you for the contribution.  Yes, there would need to be some 
significant style changes but it would be a welcome addition.
   
   I'll try and give it a more detailed look tomorrow and play around with it.  
At a glance a few thoughts:
   
    * We don't want to use any blocking queues, they will need to be resizable. 
 The execution engine follows a thread-per-core model so any blocking thread is 
wasted resources (and a potential deadlock).  Instead, when the queue "limit" 
is surpassed, we should call PauseProducing on the inputs until the queue is 
sufficiently drained.
    * What happens if the key column(s) have very many values?  I didn't trace 
all the paths but I think that means the memo store could get quite large and 
become a memory pressure concern.  Maybe it is only a concern for malicious 
inputs and we can just reject the query as invalid.  Long term we could 
probably investigate just storing the row and not the batch, or, if even that 
is too large, spilling the memo table to disk.  I don't think any of this is 
something we need to solve now, just random thoughts.
    * We have some utilities for working with row-major data that we had to 
come up with for the hash-join.  I don't recall if these are in the current 
implementation, the [proposed 
implementation](https://github.com/apache/arrow/pull/12326), or both.  However, 
I bet we can find some overlap here to share utilities.  I'll try and figure 
out some suggestions after a detailed look.
    * What approach are you thinking wrt sequencing as discussed on the mailing 
list?
    * Have you given much thought to what an as-of join looks like in Substrait 
(for example, the current `keys` option won't work I don't think because it is 
name based (if I understand the meaning) and not index based)
    * Happy to give some thoughts on threading.  My gut instinct is that 
threading the join itself won't be critically important for most hardware but I 
could easily be very wrong on this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to