On Sat, Jan 11, 2025 at 7:42 PM Tomas Vondra <to...@vondra.me> wrote: > > I had a quiet evening yesterday, so I decided to take a stab at this and > see how hard would it be, and how bad would the impact be. Attached is > an experimental patch, doing the *bare* minimum for a simple query: > > 1) It defines a limit of 128 batches (a bit low, but also 1MB). In > practice we'd use something like 256 - 1024, probably. Doesn't matter. > > 2) Ensures the initial pass over data in MultiExecPrivateHash does not > use more than 128 batches, switches to "tooManyBatches=true" if that > happens (and dumps the batch to file ExecHashDumpBatchToFile, even if > it's batchno=0). And later it calls ExecHashHandleTooManyBatches() to > increase the nbatch further. > > 3) Does something similar for the outer relation - if there are too many > batches, we do ExecHashJoinRepartitionBatches() which first partitions > into 128 batches. This only does a single pass in the WIP, though. > Should be recursive or something. > > 4) Extends the BufFile API with BufFileHasBuffer/BufFileFreeBuffer so > that the code can free the buffers. It also means the buffer needs to be > allocated separately, not embedded in BufFile struct. (I'm a bit > surprised it works without having to re-read the buffer after freeing > it, but that's probably thanks to how hashjoin uses the files).
I started looking at this. Even though you do explain what it does above, I still found it a bit hard to follow. Could you walk through an example (like the one you gave in SQL) but explaining what happens in the implementation? Basically what you have in 2 and 3 above but with a specific example. This is my understanding of what this does: if we are at the max number of batches when building the hashtable and we run out of space and need to double nbatches, we 1. dump the data from the current batch that is in the hashtable into a file 2. close and flush are the currently open buffiles, double the number of batches, and then only open files for the batches we need to store tuples from the batch we were trying to put in the hashtable when we hit the limit (now in a temp file) I also don't understand why ExecHashJoinRepartitionBatches() is needed -- I think it has something to do with needing a certain number of buffers open while processing batch 0, but what does this have to do with the outer side of the join? Another random question: why doesn't ExecHashHandleTooManyBatches() free the outer batch files? - Melanie