Here are some review comments for the patch v8-0001: ======
1. Commit message 1a. Majority of the logic on the subscriber side has already existed in the code. SUGGESTION The majority of the logic on the subscriber side already exists in the code. ~ 1b. Second, when REPLICA IDENTITY IS FULL on the publisher and an index is used on the subscriber... SUGGESTION Second, when REPLICA IDENTITY FULL is on the publisher and an index is used on the subscriber... ~ 1c. Still, below I try to show case the potential improvements using an index on the subscriber `pgbench_accounts(bid)`. With the index, the replication catches up around ~5 seconds. When the index is dropped, the replication takes around ~300 seconds. "show case" -> "showcase" ~ 1d. In above text, what was meant by "catches up around ~5 seconds"? e.g. Did it mean *improves* by ~5 seconds, or *takes* ~5 seconds? ~ 1e. // create one indxe, even on a low cardinality column typo "indxe" ====== 2. GENERAL 2a. There are lots of single-line comments that start lowercase, but by convention, I think they should start uppercase. e.g. + /* we should always use at least one attribute for the index scan */ e.g. + /* we might not need this if the index is unique */ e.g. + /* avoid expensive equality check if index is unique */ e.g. + /* unrelated Path, skip */ e.g. + /* simple case, we already have an identity or pkey */ e.g. + /* indexscans are disabled, use seq. scan */ e.g. + /* target is a regular table */ ~~ 2b. There are some excess blank lines between the function. By convention, I think 1 blank line is normal, but here there are sometimes 2. ~~ 2c. There are some new function comments which include their function name in the comment. It seemed unnecessary. e.g. GetCheapestReplicaIdentityFullPath e.g. FindUsableIndexForReplicaIdentityFull e.g. LogicalRepUsableIndex ====== 3. src/backend/executor/execReplication.c - build_replindex_scan_key - int attoff; + int index_attoff; + int scankey_attoff; bool isnull; Datum indclassDatum; oidvector *opclass; int2vector *indkey = &idxrel->rd_index->indkey; - bool hasnulls = false; - - Assert(RelationGetReplicaIndex(rel) == RelationGetRelid(idxrel) || - RelationGetPrimaryKeyIndex(rel) == RelationGetRelid(idxrel)); indclassDatum = SysCacheGetAttr(INDEXRELID, idxrel->rd_indextuple, Anum_pg_index_indclass, &isnull); Assert(!isnull); opclass = (oidvector *) DatumGetPointer(indclassDatum); + scankey_attoff = 0; Maybe just assign scankey_attoff = 0 at the declaration? ~~~ 4. + /* + * There are two cases to consider. First, if the index is a primary or + * unique key, we cannot have any indexes with expressions. So, at this + * point we are sure that the index we deal is not these. + */ "we deal" -> "we are dealing with" ? ~~~ 5. + /* + * For a non-primary/unique index with an additional expression, do + * not have to continue at this point. However, the below code + * assumes the index scan is only done for simple column references. + */ + continue; Is this one of those comments that ought to have a "XXX" prefix as a note for the future? ~~~ 6. - int pkattno = attoff + 1; ... /* Initialize the scankey. */ - ScanKeyInit(&skey[attoff], - pkattno, + ScanKeyInit(&skey[scankey_attoff], + index_attoff + 1, BTEqualStrategyNumber, Wondering if it would have been simpler if you just did: int pkattno = index_attoff + 1; ~~~ 7. - skey[attoff].sk_flags |= SK_ISNULL; + skey[scankey_attoff].sk_flags |= SK_ISNULL; + skey[scankey_attoff].sk_flags |= SK_SEARCHNULL; SUGGESTION skey[scankey_attoff].sk_flags |= (SK_ISNULL | SK_SEARCHNULL) ~~~ 8. src/backend/executor/execReplication.c - RelationFindReplTupleByIndex @@ -128,28 +171,44 @@ RelationFindReplTupleByIndex(Relation rel, Oid idxoid, TransactionId xwait; Relation idxrel; bool found; + TypeCacheEntry **eq; + bool indisunique; + int scankey_attoff; /* Open the index. */ idxrel = index_open(idxoid, RowExclusiveLock); + indisunique = idxrel->rd_index->indisunique; + + /* we might not need this if the index is unique */ + eq = NULL; Maybe just default assign eq = NULL in the declaration? ~~~ 9. + scan = index_beginscan(rel, idxrel, &snap, + scankey_attoff, 0); Unnecessary wrapping? ~~~ 10. + /* we only need to allocate once */ + if (eq == NULL) + eq = palloc0(sizeof(*eq) * outslot->tts_tupleDescriptor->natts); But shouldn't you also free this 'eq' before the function returns, to prevent leaking memory? ====== 11. src/backend/replication/logical/relation.c - logicalrep_rel_open + /* + * Finding a usable index is an infrequent operation, it is performed + * only when first time an operation is performed on the relation or + * after invalidation of the relation cache entry (e.g., such as ANALYZE). + */ SUGGESTION (minor rewording) Finding a usable index is an infrequent task. It is performed only when an operation is first performed on the relation, or after invalidation of the relation cache entry (e.g., such as ANALYZE). ~~~ 12. src/backend/replication/logical/relation.c - logicalrep_partition_open Same as comment #11 above. ~~~ 13. src/backend/replication/logical/relation.c - GetIndexOidFromPath +static +Oid +GetIndexOidFromPath(Path *path) Typically I think 'static Oid' should be on one line. ~~~ 14. + switch (path->pathtype) + { + case T_IndexScan: + case T_IndexOnlyScan: + { + IndexPath *index_sc = (IndexPath *) path; + indexOid = index_sc->indexinfo->indexoid; + + break; + } + + default: + indexOid = InvalidOid; + } Is there any point in using a switch statement when there is only one functional code block? Why not just do: if (path->pathtype == T_IndexScan || path->pathtype == T_IndexOnlyScan) { ... } return InvalidOid; ~~~ 15. src/backend/replication/logical/relation.c - IndexOnlyOnExpression + * Returns true if the given index consist only of expressions such as: + * CREATE INDEX idx ON table(foo(col)); "consist" -> "consists" ~~~ 16. +IndexOnlyOnExpression(IndexInfo *indexInfo) +{ + int i=0; + for (i = 0; i < indexInfo->ii_NumIndexKeyAttrs; i++) Don't initialise 'i' twice. ~~~ 17. + AttrNumber attnum = indexInfo->ii_IndexAttrNumbers[i]; + if (AttributeNumberIsValid(attnum)) + return false; + + } Spurious blank line ~~~ 18. src/backend/replication/logical/relation.c - GetCheapestReplicaIdentityFullPath +/* + * Iterates over the input path list, and returns another path list + * where paths with non-btree indexes, partial indexes or + * indexes on only expressions are eliminated from the list. + */ "path list, and" -> "path list and" ~~~ 19. + if (!OidIsValid(indexOid)) + { + /* unrelated Path, skip */ + suitableIndexList = lappend(suitableIndexList, path); + continue; + } + + indexRelation = index_open(indexOid, AccessShareLock); + indexInfo = BuildIndexInfo(indexRelation); + is_btree_index = (indexInfo->ii_Am == BTREE_AM_OID); + is_partial_index = (indexInfo->ii_Predicate != NIL); + is_index_only_on_expression = IndexOnlyOnExpression(indexInfo); + index_close(indexRelation, NoLock); + + if (!is_btree_index || is_partial_index || is_index_only_on_expression) + continue; Maybe better to change this logic using if/else and changing the last condition so them you can avoid having any of those 'continue' in this loop. ~~~ 20. src/backend/replication/logical/relation.c - GetCheapestReplicaIdentityFullPath +/* + * GetCheapestReplicaIdentityFullPath generates all the possible paths + * for the given subscriber relation, assuming that the source relation + * is replicated via REPLICA IDENTITY FULL. + * + * The function assumes that all the columns will be provided during + * the execution phase, given that REPLICA IDENTITY FULL gurantees + * that. + */ 20a. typo "gurantees" ~ 20b. The function comment neglects to say that after getting all these paths the final function return is the cheapest one that it found. ~~~ 21. + for (attno = 0; attno < RelationGetNumberOfAttributes(localrel); attno++) + { + Form_pg_attribute attr = TupleDescAttr(localrel->rd_att, attno); + + if (attr->attisdropped) + { + continue; + } + else + { + Expr *eq_op; Maybe simplify by just removing the 'else' or instead just reverse the condition of the 'if'. ~~~ 22. + /* + * A sequential scan has could have been dominated by + * by an index scan during make_one_rel(). We should always + * have a sequential scan before set_cheapest(). + */ "has could have been" -> "could have been" ~~~ 23. src/backend/replication/logical/relation.c - LogicalRepUsableIndex +static Oid +LogicalRepUsableIndex(Relation localrel, LogicalRepRelation *remoterel) +{ + Oid idxoid; + + /* + * We never need index oid for partitioned tables, always rely on leaf + * partition's index. + */ + if (localrel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE) + return InvalidOid; + + /* simple case, we already have an identity or pkey */ + idxoid = GetRelationIdentityOrPK(localrel); + if (OidIsValid(idxoid)) + return idxoid; + + /* indexscans are disabled, use seq. scan */ + if (!enable_indexscan) + return InvalidOid; I thought the (!enable_indexscan) fast exit perhaps should be done first, or at least before calling GetRelationIdentityOrPK. ====== 24. src/backend/replication/logical/worker.c - apply_handle_delete_internal @@ -2034,12 +2021,14 @@ apply_handle_delete_internal(ApplyExecutionData *edata, EPQState epqstate; TupleTableSlot *localslot; bool found; + Oid usableIndexOid = usable_indexoid_internal(edata, relinfo); EvalPlanQualInit(&epqstate, estate, NULL, NIL, -1); ExecOpenIndices(relinfo, false); - found = FindReplTupleInLocalRel(estate, localrel, remoterel, - remoteslot, &localslot); + + found = FindReplTupleInLocalRel(estate, localrel, usableIndexOid, + remoterel, remoteslot, &localslot); 24a. Excess blank line above FindReplTupleInLocalRel call. ~ 24b. This code is almost same in function handle_update_internal(), except the wrapping of the params is different. Better to keep everything consistent looking. ~~~ 25. src/backend/replication/logical/worker.c - usable_indexoid_internal +/* + * Decide whether we can pick an index for the relinfo (e.g., the relation) + * we're actually deleting/updating from. If it is a child partition of + * edata->targetRelInfo, find the index on the partition. + */ +static Oid +usable_indexoid_internal(ApplyExecutionData *edata, ResultRelInfo *relinfo) I'm not sure is this can maybe return InvalidOid? The function comment should clarify it. ~~~ 26. I might be mistaken, but somehow I feel this function can be simplified. e.g. If you have a var 'relmapentry' and let the normal table use the initial value of that. Then I think you only need to test for the partitioned table and reassign that var as appropriate. It also removes the need for having 'usableIndexOid' var. FOR EXAMPLE, static Oid usable_indexoid_internal(ApplyExecutionData *edata, ResultRelInfo *relinfo) { ResultRelInfo *targetResultRelInfo = edata->targetRelInfo; LogicalRepRelMapEntry *relmapentry = edata->targetRel; Oid targetrelid = targetResultRelInfo->ri_RelationDesc->rd_rel->oid; Oid localrelid = relinfo->ri_RelationDesc->rd_id; if (targetrelid != localrelid) { /* * Target is a partitioned table, so find relmapentry of the partition. */ TupleConversionMap *map = relinfo->ri_RootToPartitionMap; AttrMap *attrmap = map ? map->attrMap : NULL; LogicalRepPartMapEntry *part_entry = logicalrep_partition_open(relmapentry, relinfo->ri_RelationDesc, attrmap); Assert(targetResultRelInfo->ri_RelationDesc->rd_rel->relkind == RELKIND_PARTITIONED_TABLE); relmapentry = part_entry->relmapentry; } return relmapentry->usableIndexOid; } ~~~ 27. + /* + * Target is a partitioned table, get the index oid the partition. + */ SUGGESTION Target is a partitioned table, so get the index oid of the partition. or (see the example of comment @26) ~~~ 28. src/backend/replication/logical/worker.c - FindReplTupleInLocalRel @@ -2093,12 +2125,11 @@ FindReplTupleInLocalRel(EState *estate, Relation localrel, *localslot = table_slot_create(localrel, &estate->es_tupleTable); I think this might have been existing functionality... The comment says "* Local tuple, if found, is returned in '*localslot'." But the code is unconditionally doing table_slot_create() before it even knows if a tuple was found or not. So what about when it is NOT found - in that case shouldn't there be some cleaning up that (unused?) table slot that got unconditionally created? ~~~ 29. src/backend/replication/logical/worker.c - apply_handle_tuple_routing @@ -2202,13 +2233,17 @@ apply_handle_tuple_routing(ApplyExecutionData *edata, * suitable partition. */ { + LogicalRepRelMapEntry *entry; TupleTableSlot *localslot; ResultRelInfo *partrelinfo_new; bool found; + entry = &part_entry->relmapentry; Maybe just do this assignment at the entry declaration time? ~~~ 30. /* Get the matching local tuple from the partition. */ found = FindReplTupleInLocalRel(estate, partrel, - &part_entry->remoterel, + part_entry->relmapentry.usableIndexOid, + &entry->remoterel, remoteslot_part, &localslot); Why not use the new 'entry' var just assigned instead of repeating part_entry->relmapentry? SUGGESTION found = FindReplTupleInLocalRel(estate, partrel, entry->usableIndexOid, &entry->remoterel, remoteslot_part, &localslot); ~~~ 31. + slot_modify_data(remoteslot_part, localslot, entry, newtup); Unnecessary wrapping. ====== 32. src/include/replication/logicalrelation.h +typedef struct LogicalRepPartMapEntry +{ + Oid partoid; /* LogicalRepPartMap's key */ + LogicalRepRelMapEntry relmapentry; +} LogicalRepPartMapEntry; IIUC this struct has been moved from relation.c to here. But I think there was a large comment about this struct which maybe needs to be moved with it (see the original relation.c). /* * Partition map (LogicalRepPartMap) * * When a partitioned table is used as replication target, replicated * operations are actually performed on its leaf partitions, which requires * the partitions to also be mapped to the remote relation. Parent's entry * (LogicalRepRelMapEntry) cannot be used as-is for all partitions, because * individual partitions may have different attribute numbers, which means * attribute mappings to remote relation's attributes must be maintained * separately for each partition. */ ====== 33. .../subscription/t/032_subscribe_use_index.pl Typo "MULTIPILE" This typo occurs several times... e.g. # Testcase start: SUBSCRIPTION USES INDEX MODIFIES MULTIPILE ROWS e.g. # Testcase end: SUBSCRIPTION USES INDEX MODIFIES MULTIPILE ROWS e.g. # Testcase start: SUBSCRIPTION USES INDEX WITH MULTIPILE COLUMNS e.g. # Testcase end: SUBSCRIPTION USES INDEX MODIFIES MULTIPILE ROWS ~~~ 34. # Basic test where the subscriber uses index # and only touches multiple rows What does "only ... multiple" mean? This occurs several times also. ~~~ 35. +# wait for initial table synchronization to finish +$node_subscriber->wait_for_subscription_sync; +$node_subscriber->wait_for_subscription_sync; +$node_subscriber->wait_for_subscription_sync; That triple wait looks unusual. Is it deliberate? ------ Kind Regards, Peter Smith. Fujitsu Australia