On Fri, Apr 11, 2025 at 5:50 AM James Hunter <james.hunter...@gmail.com> wrote: > I am looking at the pre-streaming code, in PG 17, as I am not familiar > with the PG 18 "streaming" code. Back in PG 17, nodeBitmapHeapscan.c > maintained two shared TBM iterators, for PQ. One of the iterators was > the actual, "fetch" iterator; the other was the "prefetch" iterator, > which kept some distance ahead of the "fetch" iterator (to hide read > latency).
We're talking at cross-purposes. The new streaming BHS isn't just issuing probabilistic hints about future access obtained from a second iterator. It has just one shared iterator connected up to the workers' ReadStreams. Each worker pulls a disjoint set of blocks out of its stream, possibly running a bunch of IOs in the background as required. The stream replaces the old ReadBuffer() call, and the old PrefetchBuffer() call and a bunch of dubious iterator synchronisation logic are deleted. These are now real IOs running in the background and for the *exact* blocks you will consume; posix_fadvise() was just a stepping towards AIO that tolerated sloppy synchronisation including being entirely wrong. If you additionally teach the iterator to work in batches, as my 0001 patch (which I didn't propose for v18) showed, then one worker might end up processing (say) 10 blocks at end-of-scan while all the other workers have finished the node, and maybe the whole query. That'd be unfair. "Ramp-down" ... 8, 4, 2, 1 has been used in one or two other places in parallel-aware nodes with internal batching as a kind of fudge to help them finish CPU work around the same time if you're lucky, and my 0002 patch shows that NOT working here. I suspect the concept itself is defunct: it no longer narrows the CPU work completion time range across workers at all well due to the elastic streams sitting in between. Any naive solution that requires cooperation/waiting for another worker to hand over final scraps of work originally allocated to it (and I don't mean the IO completion part, that all just works just fine as you say, a lot of engineering went into the buffer manager to make that true, for AIO but also in the preceding decades... what I mean here is: how do you even know which block to read?) is probably a deadlock risk. Essays have been written on the topic if you are interested. All the rest of our conversation makes no sense without that context :-) > > I admit this all sounds kinda complicated and maybe there is a much > > simpler way to achieve the twin goals of maximising I/O combining AND > > parallel query fairness. > > I tend to think that the two goals are so much in conflict, that it's > not worth trying to apply cleverness to get them to agree on things... I don't give up so easily :-)