On Fri, Jun 7, 2019 at 7:30 AM Robert Haas <robertmh...@gmail.com> wrote:
> On Thu, Jun 6, 2019 at 7:31 PM Melanie Plageman > <melanieplage...@gmail.com> wrote: > > I'm not sure I understand why you would need to compare the original > > tuples to the unmatched tuples file. > > I think I was confused. Actually, I'm still not sure I understand this > part: > > > Then, you iterate again through the outer side a third time to join it > > to the unmatched tuples in the unmatched tuples file (from the first > > chunk) and write the following to a new unmatched tuples file: > > 5 > > 9 > > 11 > > 11 > > and likewise here > > > Then you iterate a fifth time through the outer side to join it to the > > unmatched tuples in the unmatched tuples file (from the second chunk) > > and write the following to a new unmatched tuples file: > > 11 > > 11 > > So you refer to joining the outer side to the unmatched tuples file, > but how would that tell you which outer tuples had no matches on the > inner side? I think what you'd need to do is anti-join the unmatched > tuples file to the current inner batch. So the algorithm would be > something like: > > for each inner batch: > for each outer tuple: > if tuple matches inner batch then emit match > if tuple does not match inner batch and this is the first inner batch: > write tuple to unmatched tuples file > if this is not the first inner batch: > for each tuple from the unmatched tuples file: > if tuple does not match inner batch: > write to new unmatched tuples file > discard previous unmatched tuples file and use the new one for the > next iteration > > for each tuple in the final unmatched tuples file: > null-extend and emit > > If that's not what you have in mind, maybe you could provide some > similar pseudocode? Or you can just ignore me. I'm not trying to > interfere with an otherwise-fruitful discussion by being the only one > in the room who is confused... > > Yep, the pseudo-code you have above is exactly what I was thinking. I have been hacking around on my fork implementing this for the non-parallel hashjoin (my idea was to implement a parallel-friendly design but for the non-parallel-aware case and then go back and implement it for the parallel-aware hashjoin later) and have some thoughts. I'll call the whole adaptive hashjoin fallback strategy "chunked hashloop join" for the purposes of this description. I'll abbreviate the three approaches we've discussed like this: Approach A is using a separate data structure (a bitmap was the suggested pick) to track the match status of each outer tuple Approach B is the inner-join + anti-join writing out unmatched tuples to a new file for every iteration through the outer side batch (for each chunk of inner) Approach C is setting a match bit in the tuple and then writing all outer side tuples out for every iteration through the outer side (for each chunk of inner) To get started with I implemented the inner side chunking logic which is required for all of the approaches. I did a super basic version which only allows nbatches to be increased during the initial hashtable build--not during loading of subsequent batches--if a batch after batch 0 runs out of work_mem, it just loads what will fit and saves the inner page offset in the hashjoin state. Part of the allure of approaches B and C for me was that they seemed like they would require less code complexity and concurrency control because you could just write out the unmatched tuples (to probably a SharedTupleStore) without having to care about their original order or page offset. It seemed like it didn't require treating a spill file like it permits random access nor treating the tuples as ordered in a SharedTupleStore. The benefit I saw of approach B over approach C was that, in the case where more tuples are matches, it requires fewer writes than approach C--at the cost of additional reads. It would require at most the same number of writes as approach C. Approach B turned out to be problematic for many reasons. First of all, with approach B, you end up having to keep track of an additional new spill file for unmatched outer tuples for every chunk of the inner side. Each spill file could have a different number of tuples, so, any reuse of the file seems difficult to get right. For approach C (which I did not try to implement), it seems like you could get away with only maintaining two spill files for the outer side--one to be read from and one to write to. I'm sure it is more complicated than this. However, it seemed like, for approach B you would need to create and destroy entirely new unmatched tuple spill files for every chunk. Approach B was not simpler when it came to the code complexity of the state machine either -- you have to do something different for the first chunk than the other chunks (write to the unmatched tups file but read from the original spill file, whereas other chunks require writing to the unmatched tups file and reading from the unmatched tups file), which requires complexity in the state machine (and, I imagine, worker orchestration in the parallel implementation). And, you still have to process all of the unmatched tups, null-extend them, and emit them before advancing the batch. So, I decided to try out approach A. The crux of the idea (my understanding of it, at least) is to keep a separate data structure which has the match status of each outer tuple in the batch. The discussion was to do this with a bitmap in a file, but, I started with doing it with a list in memory. What I have so far is a list of structs--one for each outer tuple--where each struct has a match flag and the page offset of that tuple in the outer spill file. I add each struct to the list when I am getting each tuple from a spill file in HJ_NEED_NEW_OUTER state to join to the first chunk of the inner, and, since I only do this when I am getting an outer tuple from the spill file, I also grab the page offset and set it in the struct in the list. As I am creating the list, and, while processing each subsequent chunk of the inner, if the tuple is a match, I set the match flag to true in that outer tuple's member of the list. Then, after finishing the whole inner batch, I loop through the list, and, for each unmatched tuple, I go to that offset in the spill file and get that tuple and NULL-extend and emit it. (Currently, I have a problem with the list and it doesn't produce correct results yet.) Thinking about how to move from my list of offsets to using a bitmap, I got confused. Let me try to articulate what I think the bitmap implementation would look like: Before doing chunked hashloop join for any batch, we would need to know how many tuples are in the outer batch to make the bitmap the correct size. We could do this either with one loop through the whole outer batch file right before joining it to the inner batch (an extra loop). Or we could try and do it during the first read of the outer relation when processing batch 0 and keep a data structure with each batch number mapped to the number of outer tuples spilled to that batch. Then, once we have this number, before joining the outer to the first chunk of the inner, we would generate a bitmap with ntuples in outer batch number of bits and save it somewhere (eventually in a file, initially in the hjstate). Now, I am back to the original problem--how do you know which bit to set without somehow numbering the tuples with a unique identifier? Is there anything that uniquely identifies a spill file tuple except its offset? -- Melanie Plageman