Re: EXPLAIN: showing ReadStream / prefetch stats

Andres Freund Tue, 17 Mar 2026 13:49:35 -0700

Hi,

On 2026-03-17 20:32:44 +0100, Tomas Vondra wrote:
> On 3/17/26 19:40, Andres Freund wrote:
> > On 2026-03-17 18:51:15 +0100, Tomas Vondra wrote:
> >> Subject: [PATCH v2 2/2] explain: show prefetch stats in EXPLAIN (ANALYZE,
> >>  VERBOSE)
> >>
> >> This adds details about AIO / prefetch for a number of executor nodes
> >> using the ReadStream, notably:
> >>
> >> - SeqScan
> >> - BitmapHeapScan
> >>
> >> The statistics is tracked by the ReadStream, and then propagated up
> >> through the table AM interface.
> >>
> >> The ReadStream tracks the statistics unconditionally, i.e. even outside
> >> EXPLAIN ANALYZE etc. The amount of statistics is trivial (a handful of
> >> integer counters), it's not worth gating this by a flag.
> >>
> >> The TAM gets one new callback "scan_stats", to collect stats from the
> >> scan (which fetch tuples from the TAM). There is also a new struct
> >> TableScanStatsData/TableScanStats to separate the statistics from the
> >> actual TAM implementation.
> >
> > It's not really clear to me that this need to go through the tableam
> > interface.  All the accesses already happen within AM code, it seems we 
> > could
> > just populate fields in TableScanDesc TableScanDesc->rs_flags indicates that
> > that is desired.
> >
>
> Interesting idea to pass this data through the scan descriptor ...
>
> > That seems like it might be nicer, because
> >
> > a) there could be multiple read stream instances within a scan
> > b) we will need one such callback for each different type of scan - so we
> >    could just as well do that within the scan datastructure
> >
>
> AFAICS (a) is very similar to the argument I made regarding the
> stack-based instrumentation earlier, as it wasn't clear to me how would
> that work for scans with multiple streams. It didn't occur to me it's an
> issue for the current solution with TAM callbacks too.
>
> However, does passing the info through the scan descriptor address this?
> I don't see how could it, if there's just a single field per scan.


I'd expect the AM to merge the stats internally.


> Also, what do you mean by "one callback for each different type of
> scan"? Right now we need two callbacks - one for TableScanDesc and one
> for IndexScanDesc.

Those two were what I was referring to. The only reason we don't need more is
that we already are combining different scan types (seq, bitmap, tid, sample)
via TableScanDesc, so we really already are quite strongly associating the
stats with the *ScanDesc structures.

Regardless of how we structure it, since we already combine stats for streams
in the parallel query stats, perhaps we should add a field to track for how
many streams the stats are?


> Do we expect to get more?

I think we eventually might want to use read streams for things like getting
the source tuple to update/delete in a bulk update and for detoasting during
expression evaluation (which is surprisingly often IO bound).  These would be
bound to an AM but currently we don't have a real descriptor (however, the
need for that has come up repeatedly, e.g. to be able to use multi_insert for
INSERT statements).

I also suspect that there are some executor nodes that could benefit from
doing streaming IO internally, e.g. Sort, Materialize, spilling Hash Agg, Hash
joins...  But that'd not be abstracted behind an AM or such, so it's not
really orthogonal.


For things like bulk inserts it would be quite nice if we could use AIO for
reading in index pages with fewer synchronous waits by doing a loop over the
to-be-inserted entries and doing an index lookup that stops at the first miss
and start reading that in.


Before thinking about detoasting I had a bit of a hard time seeing a clear
advantage of using the stack based infra - thinking we always should have very
clear association between the executor node and the stream (in contrast to the
buffer stats which are collected without any possibility to make such an
association).  However, if we wanted to e.g. track streams that are done as
part of expression evaluation - I am not at all sure whether we do - that'd
dissolve that clear association, and would be much easier to implement with
the stack based approach.


> I don't see much difference between an optional TAM callback for each
> scan type, and an optional field in each scan descriptor.

It's not entirely different, I agree.

The scan descriptor approach seems a tad nicer because it means that a) the
read stream can be terminated once it's not needed before we get to the stats
collection phase and b) it allows to combine the stats for multiple streams,
without the AM having to store the stats in the AM specific part of the
descriptor, in a field that is then just returned by the AM.


> We can base this on rs_flags, but wouldn't it be required in all
> table_beginscan variants anyway? I have initially planned to pass a flag
> indicating we're in EXPLAIN (ANALYZE), but rs_flags is not that. It's
> hard-coded in tableam.

I think Melanie has a patch that changes that :)

Until then you could just add a helper that ORs in the instrumentation flag.



> >> +/*
> >> + * read_stream_update_stats_io
> >> + *                update read_stream stats about size of I/O requests
> >> + *
> >> + * We count the number of I/O requests, size of requests (counted in 
> >> blocks)
> >> + * and number of in-progress I/Os.
> >> + */
> >> +static inline void
> >> +read_stream_update_stats_io(ReadStream *stream, int nblocks, int 
> >> in_progress)
> >> +{
> >> +  stream->stats.io_count++;
> >> +  stream->stats.io_nblocks += nblocks;
> >> +  stream->stats.io_in_progress += in_progress;
> >> +}
> >
> > What's the point of the caller passing in stream->ios_in_progress?  Are you
> > expecting some callers to pass in different values? If so, why?
> >
>
> Are you suggesting we remove the argument and just use
> stream->ios_in_progress directly?

Yes, that's what I am suggesting.

Greetings,

Andres Freund

Re: EXPLAIN: showing ReadStream / prefetch stats

Reply via email to