icexelloss commented on PR #13028: URL: https://github.com/apache/arrow/pull/13028#issuecomment-1113351284
Thanks @westonpace > We don't want to use any blocking queues, they will need to be resizable. The execution engine follows a thread-per-core model so any blocking thread is wasted resources (and a potential deadlock). Instead, when the queue "limit" is surpassed, we should call PauseProducing on the inputs until the queue is sufficiently drained. I see. I will try to change this. > What happens if the key column(s) have very many values? I didn't trace all the paths but I think that means the memo store could get quite large and become a memory pressure concern. Maybe it is only a concern for malicious inputs and we can just reject the query as invalid. Long term we could probably investigate just storing the row and not the batch, or, if even that is too large, spilling the memo table to disk. I don't think any of this is something we need to solve now, just random thoughts. Hmm..not sure if I follow what u mean... Memo store just stores row/index into the batch (it only keeps the latest row for each batch for each key). Even if there are many keys they might just point to the same batch, so it should be a pretty light weight data structure (a couple of pointers per key). Maybe you are concerning many key values pointing to different batches? >We have some utilities for working with row-major data that we had to come up with for the hash-join. I don't recall if these are in the current implementation, the https://github.com/apache/arrow/pull/12326, or both. However, I bet we can find some overlap here to share utilities. I'll try and figure out some suggestions after a detailed look. Cool thanks. I was sure if I should reuse `RowEncoder` or not. > What approach are you thinking wrt sequencing as discussed on the mailing list? I haven't put too much thought into it and want to leave it out of the scope of this PR. (Seems like adding foundation logic for ordered data could be a separate PR). At the high level, I think upstream provides batch index and downstream node reorder them and split out to desk is a reasonable approach, just not sure if there is other solution that is simpler. > Have you given much thought to what an as-of join looks like in Substrait (for example, the current keys option won't work I don't think because it is name based (if I understand the meaning) and not index based) Yes - @rtpsw did some POC on Substrait + Asof Join and managed to get it to work. We did change to index based from name based in the substrait plan IIRC. > Happy to give some thoughts on threading. My gut instinct is that threading the join itself won't be critically important for most hardware but I could easily be very wrong on this. Cool thanks. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
