Re: Pluggable Storage - Andres's take

Andres Freund Wed, 08 May 2019 14:44:18 -0700

Hi,

On 2019-04-29 16:17:41 -0700, Ashwin Agrawal wrote:
> On Thu, Apr 25, 2019 at 3:43 PM Andres Freund <and...@anarazel.de> wrote:
> > Hm. I think some of those changes would be a bit bigger than I initially
> > though. Attached is a more minimal fix that'd route
> > RelationGetNumberOfBlocksForFork() through tableam if necessary.  I
> > think it's definitely the right answer for 1), probably the pragmatic
> > answer to 2), but certainly not for 3).
> >
> > I've for now made the AM return the size in bytes, and then convert that
> > into blocks in RelationGetNumberOfBlocksForFork(). Most postgres callers
> > are going to continue to want it internally as pages (otherwise there's
> > going to be way too much churn, without a benefit I can see). So I
> > think that's OK.
> 
> I will provide my inputs, Heikki please correct me or add your inputs.
>
> I am not sure how much gain this practically provides, if rest of the
> system continues to use the value returned in-terms of blocks. I
> understand things being block based (and not just really block based
> but all the blocks of relation are storing data and full tuple) is
> engraved in the system. So, breaking out of it is yes much larger
> change and not just limited to table AM API.


I don't think it's that ingrained in all that many parts of the
system. Outside of the places I listed upthread, and the one index case
that stashes extra info, which places are that "block based"?


> I feel most of the issues discussed here should be faced by zheap as
> well, as not all blocks/pages contain data like TPD pages should be
> excluded from sampling and TID scans, etc...

It's not a problem so far, and zheap works on tableam. You can just skip
such blocks during sampling / analyze, and return nothing for tidscans.


> > > 2) commands/analyze.c, computing pg_class.relpages
> > >
> > >    This should imo be moved to the tableam callback. It's currently done
> > >    a bit weirdly imo, with fdws computing relpages the callback, but
> > >    then also returning the acquirefunc. Seems like it should entirely be
> > >    computed as part of calling acquirefunc.
> >
> > Here I'm not sure routing RelationGetNumberOfBlocksForFork() through
> > tableam wouldn't be the right minimal approach too. It has the
> > disadvantage of implying certain values for the
> > RelationGetNumberOfBlocksForFork(MAIN) return value.  The alternative
> > would be to return the desire sampling range in
> > table_beginscan_analyze() - but that'd require some duplication because
> > currently that just uses the generic scan_begin() callback.
> 
> Yes, just routing relation size via AM layer and using its returned
> value in terms of blocks still and performing sampling based on blocks
> based on it, doesn't feel resolves the issue. Maybe need to delegate
> sampling completely to AM layer. Code duplication can be avoided by
> similar AMs (heap and zheap) possible using some common utility
> functions to achieve intended result.

I don't know what this is actually proposing.



> > I suspect - as previously mentioned- that we're going to have to extend
> > statistics collection beyond the current approach at some point, but I
> > don't think that's now. At least to me it's not clear how to best
> > represent the stats, and how to best use them, if the underlying storage
> > is fundamentally not block best.  Nor how we'd avoid code duplication...
> 
> Yes, will have to give more thoughts into this.
> 
> >
> > > 3) nodeTidscan, skipping over too large tids
> > >    I think this should just be moved into the AMs, there's no need to
> > >    have this in nodeTidscan.c
> >
> > I think here it's *not* actually correct at all to use the relation
> > size. It's currently doing:
> >
> >         /*
> >          * We silently discard any TIDs that are out of range at the time 
> > of scan
> >          * start.  (Since we hold at least AccessShareLock on the table, it 
> > won't
> >          * be possible for someone to truncate away the blocks we intend to
> >          * visit.)
> >          */
> >         nblocks = 
> > RelationGetNumberOfBlocks(tidstate->ss.ss_currentRelation);
> >
> > which is fine (except for a certain abstraction leakage) for an AM like
> > heap or zheap, but I suspect strongly that that's not ok for Ashwin &
> > Heikki's approach where tid isn't tied to physical representation.
> 
> Agree, its not nice to have that optimization being performed based on
> number of block in generic layer. I feel its not efficient either for
> zheap too due to TPD pages as mentioned above, as number of blocks
> returned will be higher compared to actually data blocks.

I don't think there's a problem for zheap. The blocks are just
interspersed.

Having pondered this a lot more, I think this is the way to go for
12. Then we can improve this for v13, to be nice.


> > The proper fix seems to be to introduce a new scan variant
> > (e.g. table_beginscan_tid()), and then have table_fetch_row_version take
> > a scan as a parameter.  But it seems we'd have to introduce that as a
> > separate tableam callback, because we'd not want to incur the overhead
> > of creating an additional scan / RelationGetNumberOfBlocks() checks for
> > triggers et al.
> 
> Thinking out loud here, we can possibly tackle this in multiple ways.
> First above mentioned check seems more optimization to me than
> functionally needed, correct me if wrong. If that's true we can check
> with AM if wish to apply that optimization or not based on relation
> size.

It'd be really expensive to check this differently for heap. We'd have
to check the relation size, which is out of the question imo.

Greetings,

Andres Freund

Re: Pluggable Storage - Andres's take

Reply via email to