Hi, Thanks for working on this!
On Wed, 10 May 2023 15:11:20 +1200 Thomas Munro <thomas.mu...@gmail.com> wrote: > One complaint about PHJ is that it can, in rare cases, use a > surprising amount of temporary disk space where non-parallel HJ would > not. When it decides that it needs to double the number of batches to > try to fit each inner batch into memory, and then again and again > depending on your level of bad luck, it leaves behind all the earlier > generations of inner batch files to be cleaned up at the end of the > query. That's stupid. Here's a patch to unlink them sooner, as a > small improvement. This patch can indeed save a decent amount of temporary disk space. Considering its complexity is (currently?) quite low, it worth it. > The reason I didn't do this earlier is that sharedtuplestore.c > continues the pre-existing tradition where each parallel process > counts what it writes against its own temp_file_limit. At the time I > thought I'd need to have one process unlink all the files, but if a > process were to unlink files that it didn't create, that accounting > system would break. Without some new kind of shared temp_file_limit > mechanism that doesn't currently exist, per-process counters could go > negative, creating free money. In the attached patch, I realised > something that I'd missed before: there is a safe point for each > backend to unlink just the files that it created, and there is no way > for a process that created files not to reach that point. Indeed. For what it worth, from my new and non-experienced understanding of the parallel mechanism, waiting for all workers to reach WAIT_EVENT_HASH_GROW_BATCHES_REPARTITION, after re-dispatching old batches in new ones, seems like a safe place to instruct each workers to clean their old temp files. > Here's an example query that tries 8, 16 and then 32 batches on my > machine, because reltuples is clobbered with a bogus value. Nice! Regards,