On Fri, Sep 9, 2022 at 6:02 AM Dilip Kumar <dilipbal...@gmail.com> wrote: > [ new patch ]
+typedef pg_int64 RelFileNumber; This seems really random to me. First, why isn't this an unsigned type? OID is unsigned and I don't see a reason to change to a signed type. But even if we were going to change to a signed type, why pg_int64? That is declared like this: /* Define a signed 64-bit integer type for use in client API declarations. */ typedef PG_INT64_TYPE pg_int64; Surely this is not a client API declaration.... Note that if we change this a lot of references to INT64_FORMAT will need to become UINT64_FORMAT. I think we should use int64 at the SQL level, because we don't have an unsigned 64-bit SQL type, and a signed 64-bit type can hold 56 bits. So it would still be Int64GetDatum((int64) rd_rel->relfilenode) or similar. But internally I think using unsigned is cleaner. + * RelFileNumber is unique within a cluster. Not really, because of CREATE DATABASE. Probably just drop this line. Or else expand it: we never assign the same RelFileNumber twice within the lifetime of the same cluster, but there can be multiple relations with the same RelFileNumber e.g. because CREATE DATABASE duplicates the RelFileNumber values from the template database. But maybe we don't need this here, as it's already explained in relfilelocator.h. + ret = (int8) (tag->relForkDetails[0] >> BUFTAG_RELNUM_HIGH_BITS); Why not declare ret as ForkNumber instead of casting twice? + uint64 relnum; + + Assert(relnumber <= MAX_RELFILENUMBER); + Assert(forknum <= MAX_FORKNUM); + + relnum = relnumber; Perhaps it'd be better to write uint64 relnum = relnumber instead of initializing on a separate line. +#define RELNUMBERCHARS 20 /* max chars printed by %llu */ Maybe instead of %llu we should say UINT64_FORMAT (or INT64_FORMAT if there's some reason to stick with a signed type). + elog(ERROR, "relfilenumber is out of bound"); It would have to be "out of bounds", with an "s". But maybe "is too large" would be better. + nextRelFileNumber = ShmemVariableCache->nextRelFileNumber; + loggedRelFileNumber = ShmemVariableCache->loggedRelFileNumber; + flushedRelFileNumber = ShmemVariableCache->flushedRelFileNumber; Maybe it would be a good idea to asset that next <= flushed and flushed <= logged? +#ifdef USE_ASSERT_CHECKING + + { + RelFileLocatorBackend rlocator; + char *rpath; Let's add a comment here, like "Because the RelFileNumber counter only ever increases and never wraps around, it should be impossible for the newly-allocated RelFileNumber to already be in use. But, if Asserts are enabled, double check that there's no main-fork relation file with the new RelFileNumber already on disk." + elog(ERROR, "cannot forward RelFileNumber during recovery"); forward -> set (or advance) + if (relnumber >= ShmemVariableCache->loggedRelFileNumber) It probably doesn't make any difference, but to me it seems better to test flushedRelFileNumber rather than logRelFileNumber here. What do you think? /* * We set up the lockRelId in case anything tries to lock the dummy - * relation. Note that this is fairly bogus since relNumber may be - * different from the relation's OID. It shouldn't really matter though. - * In recovery, we are running by ourselves and can't have any lock - * conflicts. While syncing, we already hold AccessExclusiveLock. + * relation. Note we are setting relId to just FirstNormalObjectId which + * is completely bogus. It shouldn't really matter though. In recovery, + * we are running by ourselves and can't have any lock conflicts. While + * syncing, we already hold AccessExclusiveLock. */ rel->rd_lockInfo.lockRelId.dbId = rlocator.dbOid; - rel->rd_lockInfo.lockRelId.relId = rlocator.relNumber; + rel->rd_lockInfo.lockRelId.relId = FirstNormalObjectId; Boy, this makes me uncomfortable. The existing logic is pretty bogus, and we're replacing it with some other bogus thing. Do we know whether anything actually does try to use this for locking? One notable difference between the existing logic and your change is that, with the existing logic, we use a bogus value that will differ from one relation to the next, whereas with this change, it will always be the same value. Perhaps el->rd_lockInfo.lockRelId.relId = (Oid) rlocator.relNumber would be a more natural adaptation? +#define CHECK_RELFILENUMBER_RANGE(relfilenumber) \ +do { \ + if ((relfilenumber) < 0 || (relfilenumber) > MAX_RELFILENUMBER) \ + ereport(ERROR, \ + errcode(ERRCODE_INVALID_PARAMETER_VALUE), \ + errmsg("relfilenumber %lld is out of range", \ + (long long) (relfilenumber))); \ +} while (0) Here, you take the approach of casting the relfilenumber to long long and then using %lld. But elsewhere, you use INT64_FORMAT/UINT64_FORMAT. If we're going to use this technique, we ought to use it everywhere. typedef struct { - Oid reltablespace; - RelFileNumber relfilenumber; -} RelfilenumberMapKey; - -typedef struct -{ - RelfilenumberMapKey key; /* lookup key - must be first */ + RelFileNumber relfilenumber; /* lookup key - must be first */ Oid relid; /* pg_class.oid */ } RelfilenumberMapEntry; This feels like a bold change. Are you sure it's safe? i.e. Are you certain that there's no way that a relfilenumber could repeat within a database? If we're going to bank on that, we could adapt this more heavily, e.g. RelidByRelfilenumber() could lose the reltablespace parameter. I think maybe we should push this change into an 0002 patch (or later) and have 0001 just do a minimal adaptation for the changed data type. Datum pg_control_checkpoint(PG_FUNCTION_ARGS) { - Datum values[18]; - bool nulls[18]; + Datum values[19]; + bool nulls[19]; Documentation updated is needed. -Note that while a table's filenode often matches its OID, this is -<emphasis>not</emphasis> necessarily the case; some operations, like +Note that table's filenode are completely different than its OID. Although for +system catalogs initial filenode matches with its OID, but some operations, like <command>TRUNCATE</command>, <command>REINDEX</command>, <command>CLUSTER</command> and some forms of <command>ALTER TABLE</command>, can change the filenode while preserving the OID. -Avoid assuming that filenode and table OID are the same. Suggest: Note that a table's filenode will normally be different than the OID. For system tables, the initial filenode will be equal to the table OID, but it will be different if the table has ever been subjected to a rewriting operation, such as TRUNCATE, REINDEX, CLUSTER, or some forms of ALTER TABLE. For user tables, even the initial filenode will be different than the table OID. -- Robert Haas EDB: http://www.enterprisedb.com