[GitHub] [arrow] icexelloss commented on pull request #13028: ARROW-16083]: [WIP][C++] Implement AsofJoin execution node

GitBox Fri, 29 Apr 2022 07:03:53 -0700


icexelloss commented on PR #13028:
URL: https://github.com/apache/arrow/pull/13028#issuecomment-1113351284


   Thanks @westonpace 
   
   > We don't want to use any blocking queues, they will need to be resizable. 
The execution engine follows a thread-per-core model so any blocking thread is 
wasted resources (and a potential deadlock). Instead, when the queue "limit" is 
surpassed, we should call PauseProducing on the inputs until the queue is 
sufficiently drained.
   
   I see. I will try to change this.
   
   > What happens if the key column(s) have very many values? I didn't trace 
all the paths but I think that means the memo store could get quite large and 
become a memory pressure concern. Maybe it is only a concern for malicious 
inputs and we can just reject the query as invalid. Long term we could probably 
investigate just storing the row and not the batch, or, if even that is too 
large, spilling the memo table to disk. I don't think any of this is something 
we need to solve now, just random thoughts.
   
   Hmm..not sure if I follow what u mean... Memo store just stores row/index 
into the batch (it only keeps the latest row for each batch for each key). Even 
if there are many keys they might just point to the same batch, so it should be 
a pretty light weight data structure (a couple of pointers per key). Maybe you 
are concerning many key values pointing to different batches?
   
   >We have some utilities for working with row-major data that we had to come 
up with for the hash-join. I don't recall if these are in the current 
implementation, the https://github.com/apache/arrow/pull/12326, or both. 
However, I bet we can find some overlap here to share utilities. I'll try and 
figure out some suggestions after a detailed look.
   
   Cool thanks. I was sure if I should reuse `RowEncoder` or not.
   
   > What approach are you thinking wrt sequencing as discussed on the mailing 
list?
   
   I haven't put too much thought into it and want to leave it out of the scope 
of this PR. (Seems like adding foundation logic for ordered data could be a 
separate PR).  At the high level, I think upstream provides batch index and 
downstream node reorder them and split out to desk is a reasonable approach, 
just not sure if there is other solution that is simpler.
   
   > Have you given much thought to what an as-of join looks like in Substrait 
(for example, the current keys option won't work I don't think because it is 
name based (if I understand the meaning) and not index based)
   
   Yes - @rtpsw did some POC on Substrait + Asof Join and managed to get it to 
work. We did change to index based from name based in the substrait plan IIRC.
   
   > Happy to give some thoughts on threading. My gut instinct is that 
threading the join itself won't be critically important for most hardware but I 
could easily be very wrong on this.
   
   Cool thanks. 
    


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] icexelloss commented on pull request #13028: ARROW-16083]: [WIP][C++] Implement AsofJoin execution node

Reply via email to