westonpace commented on PR #13028:
URL: https://github.com/apache/arrow/pull/13028#issuecomment-1112696020
Thank you for the contribution. Yes, there would need to be some
significant style changes but it would be a welcome addition.
I'll try and give it a more detailed look tomorrow and play around with it.
At a glance a few thoughts:
* We don't want to use any blocking queues, they will need to be resizable.
The execution engine follows a thread-per-core model so any blocking thread is
wasted resources (and a potential deadlock). Instead, when the queue "limit"
is surpassed, we should call PauseProducing on the inputs until the queue is
sufficiently drained.
* What happens if the key column(s) have very many values? I didn't trace
all the paths but I think that means the memo store could get quite large and
become a memory pressure concern. Maybe it is only a concern for malicious
inputs and we can just reject the query as invalid. Long term we could
probably investigate just storing the row and not the batch, or, if even that
is too large, spilling the memo table to disk. I don't think any of this is
something we need to solve now, just random thoughts.
* We have some utilities for working with row-major data that we had to
come up with for the hash-join. I don't recall if these are in the current
implementation, the [proposed
implementation](https://github.com/apache/arrow/pull/12326), or both. However,
I bet we can find some overlap here to share utilities. I'll try and figure
out some suggestions after a detailed look.
* What approach are you thinking wrt sequencing as discussed on the mailing
list?
* Have you given much thought to what an as-of join looks like in Substrait
(for example, the current `keys` option won't work I don't think because it is
name based (if I understand the meaning) and not index based)
* Happy to give some thoughts on threading. My gut instinct is that
threading the join itself won't be critically important for most hardware but I
could easily be very wrong on this.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]