Hi, On 2019-04-29 16:17:41 -0700, Ashwin Agrawal wrote: > On Thu, Apr 25, 2019 at 3:43 PM Andres Freund <and...@anarazel.de> wrote: > > Hm. I think some of those changes would be a bit bigger than I initially > > though. Attached is a more minimal fix that'd route > > RelationGetNumberOfBlocksForFork() through tableam if necessary. I > > think it's definitely the right answer for 1), probably the pragmatic > > answer to 2), but certainly not for 3). > > > > I've for now made the AM return the size in bytes, and then convert that > > into blocks in RelationGetNumberOfBlocksForFork(). Most postgres callers > > are going to continue to want it internally as pages (otherwise there's > > going to be way too much churn, without a benefit I can see). So I > > think that's OK. > > I will provide my inputs, Heikki please correct me or add your inputs. > > I am not sure how much gain this practically provides, if rest of the > system continues to use the value returned in-terms of blocks. I > understand things being block based (and not just really block based > but all the blocks of relation are storing data and full tuple) is > engraved in the system. So, breaking out of it is yes much larger > change and not just limited to table AM API.
I don't think it's that ingrained in all that many parts of the system. Outside of the places I listed upthread, and the one index case that stashes extra info, which places are that "block based"? > I feel most of the issues discussed here should be faced by zheap as > well, as not all blocks/pages contain data like TPD pages should be > excluded from sampling and TID scans, etc... It's not a problem so far, and zheap works on tableam. You can just skip such blocks during sampling / analyze, and return nothing for tidscans. > > > 2) commands/analyze.c, computing pg_class.relpages > > > > > > This should imo be moved to the tableam callback. It's currently done > > > a bit weirdly imo, with fdws computing relpages the callback, but > > > then also returning the acquirefunc. Seems like it should entirely be > > > computed as part of calling acquirefunc. > > > > Here I'm not sure routing RelationGetNumberOfBlocksForFork() through > > tableam wouldn't be the right minimal approach too. It has the > > disadvantage of implying certain values for the > > RelationGetNumberOfBlocksForFork(MAIN) return value. The alternative > > would be to return the desire sampling range in > > table_beginscan_analyze() - but that'd require some duplication because > > currently that just uses the generic scan_begin() callback. > > Yes, just routing relation size via AM layer and using its returned > value in terms of blocks still and performing sampling based on blocks > based on it, doesn't feel resolves the issue. Maybe need to delegate > sampling completely to AM layer. Code duplication can be avoided by > similar AMs (heap and zheap) possible using some common utility > functions to achieve intended result. I don't know what this is actually proposing. > > I suspect - as previously mentioned- that we're going to have to extend > > statistics collection beyond the current approach at some point, but I > > don't think that's now. At least to me it's not clear how to best > > represent the stats, and how to best use them, if the underlying storage > > is fundamentally not block best. Nor how we'd avoid code duplication... > > Yes, will have to give more thoughts into this. > > > > > > 3) nodeTidscan, skipping over too large tids > > > I think this should just be moved into the AMs, there's no need to > > > have this in nodeTidscan.c > > > > I think here it's *not* actually correct at all to use the relation > > size. It's currently doing: > > > > /* > > * We silently discard any TIDs that are out of range at the time > > of scan > > * start. (Since we hold at least AccessShareLock on the table, it > > won't > > * be possible for someone to truncate away the blocks we intend to > > * visit.) > > */ > > nblocks = > > RelationGetNumberOfBlocks(tidstate->ss.ss_currentRelation); > > > > which is fine (except for a certain abstraction leakage) for an AM like > > heap or zheap, but I suspect strongly that that's not ok for Ashwin & > > Heikki's approach where tid isn't tied to physical representation. > > Agree, its not nice to have that optimization being performed based on > number of block in generic layer. I feel its not efficient either for > zheap too due to TPD pages as mentioned above, as number of blocks > returned will be higher compared to actually data blocks. I don't think there's a problem for zheap. The blocks are just interspersed. Having pondered this a lot more, I think this is the way to go for 12. Then we can improve this for v13, to be nice. > > The proper fix seems to be to introduce a new scan variant > > (e.g. table_beginscan_tid()), and then have table_fetch_row_version take > > a scan as a parameter. But it seems we'd have to introduce that as a > > separate tableam callback, because we'd not want to incur the overhead > > of creating an additional scan / RelationGetNumberOfBlocks() checks for > > triggers et al. > > Thinking out loud here, we can possibly tackle this in multiple ways. > First above mentioned check seems more optimization to me than > functionally needed, correct me if wrong. If that's true we can check > with AM if wish to apply that optimization or not based on relation > size. It'd be really expensive to check this differently for heap. We'd have to check the relation size, which is out of the question imo. Greetings, Andres Freund