On 2/17/26 21:16, Peter Geoghegan wrote:
> On Tue, Feb 17, 2026 at 2:27 PM Andres Freund <[email protected]> wrote:
>> On 2026-02-17 12:16:23 -0500, Peter Geoghegan wrote:
>>> On Mon, Feb 16, 2026 at 11:48 AM Andres Freund <[email protected]> wrote:
>>> I agree that the current heuristics (which were invented recently) are
>>> too conservative. I overfit the heuristics to my current set of
>>> adversarial queries, as a stopgap measure.
>>
>> Are you doing any testing on higher latency storage? I found it to be quite
>> valuable to use dm_delay to have a disk with reproducible (i.e. not cloud)
>> higher latency (i.e. not just a local SSD).
>
> I sometimes use dm_delay (with the minimum 1ms delay) when testing,
> but don't do so regularly. Just because it's inconvenient to do so
> (perhaps not a great reason).
>
>> Low latency NVMe can reduce the
>> penalty of not enough readahead so much that it's hard to spot problems...
>
> I'll keep that in mind.
>
So, what counts as "higher latency" in this context? What delays should
we consider practical/relevant for testing?
>>> ISTM that we need the yields to better cooperate with whatever's
>>> happening on the read stream side.
>>
>> Plausible. It could be that we could get away with controlling the rampup to
>> be slower in potentially problematic cases, without needing the yielding, but
>> not sure.
>>
>> If that doesn't work, it might just be sufficient to increase the number of
>> batches that trigger yields as the scan goes on (perhaps by taking the number
>> of already "consumed" batches into account).
>
> It could make sense to take the number of consumed batches into
> account. In general, I think the best approach will be one that
> combines multiple complementary strategies.
>
Yes, this is roughly what I meant by "ramp up". Start by limiting the
batch distance to 2, then gradually increase that during the scan.
> Passing down a LIMIT N hint has proven to be a good idea -- and it
> doesn't really require applying any information related to the read
> stream. That's enough to prevent problems in the really extreme cases
> (e.g., nested loop antijoins with a LIMIT 1 on the inner side). The
> problematic merge join I showed you is a not-so-extreme case, which
> makes it trickier. ISTM that taking into consideration the number of
> "consumed" batches will not help that particular merge join query,
> precisely because it's not-so-extreme: the inner index scan consumes
> plenty of batches, but is nevertheless significantly regressed (at
> least when we don't yield at all).
>
>> To evaluate the amount of wasted work, it could be useful to make the read
>> stream stats page spit out the amount of "unconsumed" IOs at the end of the
>> scan.
>
> That would make sense. You can already tell when that's happened by
> comparing the details shown by EXPLAIN ANALYZE against the same query
> execution on master, but that approach is inconvenient. Automating my
> microbenchmarks has proven to be important with this project. There's
> quite a few competing considerations, and it's too easy to improve one
> query at the cost of regressing another.
>
What counts as "unconsumed IO"? The IOs the stream already started, but
then did not consume? That shouldn't be hard, I think.
regards
--
Tomas Vondra