On 1/13/25 17:32, Melanie Plageman wrote: > On Sat, Jan 11, 2025 at 7:42 PM Tomas Vondra <to...@vondra.me> wrote: >> >> I had a quiet evening yesterday, so I decided to take a stab at this and >> see how hard would it be, and how bad would the impact be. Attached is >> an experimental patch, doing the *bare* minimum for a simple query: >> >> 1) It defines a limit of 128 batches (a bit low, but also 1MB). In >> practice we'd use something like 256 - 1024, probably. Doesn't matter. >> >> 2) Ensures the initial pass over data in MultiExecPrivateHash does not >> use more than 128 batches, switches to "tooManyBatches=true" if that >> happens (and dumps the batch to file ExecHashDumpBatchToFile, even if >> it's batchno=0). And later it calls ExecHashHandleTooManyBatches() to >> increase the nbatch further. >> >> 3) Does something similar for the outer relation - if there are too many >> batches, we do ExecHashJoinRepartitionBatches() which first partitions >> into 128 batches. This only does a single pass in the WIP, though. >> Should be recursive or something. >> >> 4) Extends the BufFile API with BufFileHasBuffer/BufFileFreeBuffer so >> that the code can free the buffers. It also means the buffer needs to be >> allocated separately, not embedded in BufFile struct. (I'm a bit >> surprised it works without having to re-read the buffer after freeing >> it, but that's probably thanks to how hashjoin uses the files). > > I started looking at this. Even though you do explain what it does > above, I still found it a bit hard to follow. Could you walk through > an example (like the one you gave in SQL) but explaining what happens > in the implementation? Basically what you have in 2 and 3 above but > with a specific example. >
OK, I'll try ... see the end of this message. > This is my understanding of what this does: > if we are at the max number of batches when building the hashtable and > we run out of space and need to double nbatches, we > 1. dump the data from the current batch that is in the hashtable into a file > 2. close and flush are the currently open buffiles, double the number > of batches, and then only open files for the batches we need to store > tuples from the batch we were trying to put in the hashtable when we > hit the limit (now in a temp file) > Roughly, but the second step needs to happen only after we finish the first pass over the inner relation. I'll try to explain this as part of the example. > I also don't understand why ExecHashJoinRepartitionBatches() is needed > -- I think it has something to do with needing a certain number of > buffers open while processing batch 0, but what does this have to do > with the outer side of the join? > No, this is about building batches on the outer side. We've built the hash table, and we may have ended with a very high nbatch. We can't build all of them right away (would need too many buffiles), so we do that in multiple phases, to not cross the limit. > Another random question: why doesn't ExecHashHandleTooManyBatches() > free the outer batch files? > Because it was tailored for the example when all batch splits happen for batch 0, before we even start processing the outer side. In practice it probably should free the files. Let's do the example - as I mentioned, I only tried doing this for the case where all the batch increases happen for batch 0, before we start building the outer batches. I'm 99% sure the patch will need to modify a couple more places to handle batch increases in later stages. Assume we don't want to use more than 128 batches, but that we're running a query that needs 256 batches. The patch will do this: 1) ExecHashTableCreate will set nbatch_maximum=128 as the limit for the current pass over inner relation, and it'll cap the other nbatch fields accordingly. If we already know we'll need more batches, we set tooManyBatches=true to remember this. But let's we start with nbatch=64, nbatch_maximum=128 (and thus also with tooManyBatches=false). 2) We start loading data into the hash table, until exceed the memory limit for the first time. We double the number to 128, move some of the data from the hash table to the new batch, and continue. 3) We hit the memory limit again, but this time we've hit (nbatch == nbatch_maximum) so we can't double the number of batches. But we also can't continue adding data to the in-memory hash table, so we set tooManyBatches=true and we start spilling even the current batch to a file. 4) We finish the first pass over the inner relation with nbatch = 128 nbatch_maximum = 128 tooManyBatches = true so we need to do something. We run ExecHashHandleTooManyBatches() starts increasing the nbatches until the current batch fits into work_mem. We have nbatch=128, and the query needs nbatch=256, so we only do one loop. Note: Right now it simply doubles the number of batches in each loop. But it could be faster and do up to 128 in one step. 128 -> 16k -> 1M The later batches will already do all the increases in a single step, that needs an improvement too. 4) After ExecHashHandleTooManyBatches completed, we have the inner side of the batch mostly "done". We have nbatch=256. 5) We start building batches on the outer side, but we also don't want to build all the batches at once - we want to build 128 and only then go to 256 (or further). This is what ExecHashJoinRepartitionBatches does. If we have too many batches for one pass, we build 128 batches in the first pass. And then we just read the batch files, doing further splits. Right now this just does a single pass and thus splits the relation into 128 batches, and then just continues as before. That's enough for 256 batches, because 256 is a single step past 128. But it really should be recursive / do multiple passes, to handle more cases with more than 16k batches (although with higher limit it would be less of an issue). 5) It does free the file buffers in various places. Chances are some of those places are unnecessary, and it should be done in some more places. As I said, I don't claim this to handle all cases, especially with splits in later batches. Does this make it clearer? regards -- Tomas Vondra