Re: EXPLAIN: showing ReadStream / prefetch stats

Tomas Vondra Tue, 17 Mar 2026 14:43:46 -0700

On 3/17/26 21:49, Andres Freund wrote:
> Hi,
> 
> On 2026-03-17 20:32:44 +0100, Tomas Vondra wrote:
>> On 3/17/26 19:40, Andres Freund wrote:
>>> On 2026-03-17 18:51:15 +0100, Tomas Vondra wrote:
>>>> Subject: [PATCH v2 2/2] explain: show prefetch stats in EXPLAIN (ANALYZE,
>>>>  VERBOSE)
>>>>
>>>> This adds details about AIO / prefetch for a number of executor nodes
>>>> using the ReadStream, notably:
>>>>
>>>> - SeqScan
>>>> - BitmapHeapScan
>>>>
>>>> The statistics is tracked by the ReadStream, and then propagated up
>>>> through the table AM interface.
>>>>
>>>> The ReadStream tracks the statistics unconditionally, i.e. even outside
>>>> EXPLAIN ANALYZE etc. The amount of statistics is trivial (a handful of
>>>> integer counters), it's not worth gating this by a flag.
>>>>
>>>> The TAM gets one new callback "scan_stats", to collect stats from the
>>>> scan (which fetch tuples from the TAM). There is also a new struct
>>>> TableScanStatsData/TableScanStats to separate the statistics from the
>>>> actual TAM implementation.
>>>
>>> It's not really clear to me that this need to go through the tableam
>>> interface.  All the accesses already happen within AM code, it seems we 
>>> could
>>> just populate fields in TableScanDesc TableScanDesc->rs_flags indicates that
>>> that is desired.
>>>
>>
>> Interesting idea to pass this data through the scan descriptor ...
>>
>>> That seems like it might be nicer, because
>>>
>>> a) there could be multiple read stream instances within a scan
>>> b) we will need one such callback for each different type of scan - so we
>>>    could just as well do that within the scan datastructure
>>>
>>
>> AFAICS (a) is very similar to the argument I made regarding the
>> stack-based instrumentation earlier, as it wasn't clear to me how would
>> that work for scans with multiple streams. It didn't occur to me it's an
>> issue for the current solution with TAM callbacks too.
>>
>> However, does passing the info through the scan descriptor address this?
>> I don't see how could it, if there's just a single field per scan.
> 
> I'd expect the AM to merge the stats internally.
>


Hmmm, I'm not sure that'll be good enough. What if the two streams are
fundamentally different? For example, in the future we may end up using
two streams in index scans. One stream for the table (which is what the
prefetching is doing), and one for the index itself. Does it really make
sense to merge stats for the streams?

Furthermore, which place would be responsible for collecting and merging
the stats? AFAICS it needs to happen at the end of the scan, but it
can't happen in endscan, because that frees the descriptor (so we can't
store the result there). Wouldn't we need a new callback anyway?

> 
>> Also, what do you mean by "one callback for each different type of
>> scan"? Right now we need two callbacks - one for TableScanDesc and one
>> for IndexScanDesc.
> 
> Those two were what I was referring to. The only reason we don't need more is
> that we already are combining different scan types (seq, bitmap, tid, sample)
> via TableScanDesc, so we really already are quite strongly associating the
> stats with the *ScanDesc structures.
> 
> Regardless of how we structure it, since we already combine stats for streams
> in the parallel query stats, perhaps we should add a field to track for how
> many streams the stats are?
> 

We could. But does it make sense to combine stats from distinct streams?

> 
>> Do we expect to get more?
> 
> I think we eventually might want to use read streams for things like getting
> the source tuple to update/delete in a bulk update and for detoasting during
> expression evaluation (which is surprisingly often IO bound).  These would be
> bound to an AM but currently we don't have a real descriptor (however, the
> need for that has come up repeatedly, e.g. to be able to use multi_insert for
> INSERT statements).
> 

True. But then again - does it make sense to "merge" stats form all
these streams used for very different purposes? OTOH I'm not sure it
works with the TAM approach either, we don't want a separate callback
for each individual stream.

> I also suspect that there are some executor nodes that could benefit from
> doing streaming IO internally, e.g. Sort, Materialize, spilling Hash Agg, Hash
> joins...  But that'd not be abstracted behind an AM or such, so it's not
> really orthogonal.
> 

Agreed.

> 
> For things like bulk inserts it would be quite nice if we could use AIO for
> reading in index pages with fewer synchronous waits by doing a loop over the
> to-be-inserted entries and doing an index lookup that stops at the first miss
> and start reading that in.
> 

Yes. I had a very ugly PoC last year to prefetch leaf pages for bulk
inserts, and it was a huge benefit.

> 
> Before thinking about detoasting I had a bit of a hard time seeing a clear
> advantage of using the stack based infra - thinking we always should have very
> clear association between the executor node and the stream (in contrast to the
> buffer stats which are collected without any possibility to make such an
> association).  However, if we wanted to e.g. track streams that are done as
> part of expression evaluation - I am not at all sure whether we do - that'd
> dissolve that clear association, and would be much easier to implement with
> the stack based approach.
> 

Right. In some cases the association to streams may not be quite clear.

Would it be possible for scans to "define" a list of streams? For
example, an index scan have one stream for the index, one stream for
table, one stream for TOAST. And we'd have an array of stats, with one
slot for each stream.

With the current approach, we'd collect stats from each of all those
streams into the correct "slot" in the instrumentation.

With the stack-based instrumentation we might "tell" each stream which
slot to use. So ReadStream would get a field "stats_index" and it'd be
passed to INSTR_IOUSAGE_INCR() to update the correct slot.

I haven't tried any of this, and AFAIK we don't even have a scan using
two read streams yet.

> 
>> I don't see much difference between an optional TAM callback for each
>> scan type, and an optional field in each scan descriptor.
> 
> It's not entirely different, I agree.
> 
> The scan descriptor approach seems a tad nicer because it means that a) the
> read stream can be terminated once it's not needed before we get to the stats
> collection phase and b) it allows to combine the stats for multiple streams,
> without the AM having to store the stats in the AM specific part of the
> descriptor, in a field that is then just returned by the AM.
> 
> 
>> We can base this on rs_flags, but wouldn't it be required in all
>> table_beginscan variants anyway? I have initially planned to pass a flag
>> indicating we're in EXPLAIN (ANALYZE), but rs_flags is not that. It's
>> hard-coded in tableam.
> 
> I think Melanie has a patch that changes that :)
> 
> Until then you could just add a helper that ORs in the instrumentation flag.
> 

OK.

> 
> 
>>>> +/*
>>>> + * read_stream_update_stats_io
>>>> + *                update read_stream stats about size of I/O requests
>>>> + *
>>>> + * We count the number of I/O requests, size of requests (counted in 
>>>> blocks)
>>>> + * and number of in-progress I/Os.
>>>> + */
>>>> +static inline void
>>>> +read_stream_update_stats_io(ReadStream *stream, int nblocks, int 
>>>> in_progress)
>>>> +{
>>>> +  stream->stats.io_count++;
>>>> +  stream->stats.io_nblocks += nblocks;
>>>> +  stream->stats.io_in_progress += in_progress;
>>>> +}
>>>
>>> What's the point of the caller passing in stream->ios_in_progress?  Are you
>>> expecting some callers to pass in different values? If so, why?
>>>
>>
>> Are you suggesting we remove the argument and just use
>> stream->ios_in_progress directly?
> 
> Yes, that's what I am suggesting.
> 

OK, will do.


regards

-- 
Tomas Vondra

Re: EXPLAIN: showing ReadStream / prefetch stats

Reply via email to