On Wed, 25 Oct 2023 at 22:48, Ashutosh Bapat <ashutosh.bapat....@gmail.com> wrote: > We may save the size of data in VirtualTupleTableSlot, thus avoiding > the first loop. I assume that when allocating > VirtualTupleTableSlot->data, we always know what size we are > allocating so it should be just a matter of saving it in > VirtualTupleTableSlot->size. This should avoid the first loop in > tts_virtual_materialize() and give some speed up. We will need a loop > to repoint non-byval Datums. I imagine that the pointers to non-byval > Datums can be computed as dest_slot->tts_values[natts] = > dest_vslot->data + (src_slot->tts_values[natts] - src_vslot->data). > This would work as long as all the non-byval datums in the source slot > are all stored flattened in source slot's data. I am assuming that > that would be true in a materialized virtual slot. The byval datums > are copied as is. I think, this will avoid multiple memcpy calls, one > per non-byval attribute and hence some speedup. I may be wrong though.
hmm, do you think it's common enough that we copy an already materialised virtual slot? I tried adding the following code totts_virtual_copyslot and didn't see the NOTICE message when running each of your test queries. "make check" also worked without anything failing after adjusting nodeUnique.c to always use a TTSOpsVirtual slot. + if (srcslot->tts_ops == &TTSOpsVirtual && TTS_SHOULDFREE(srcslot)) + elog(NOTICE, "We copied a materialized virtual slot!"); I did get a failure in postgres_fdw's tests: server loopback options (table_name 'tab_batch_sharded_p1_remote'); insert into tab_batch_sharded select * from tab_batch_local; +NOTICE: We copied a materialized virtual slot! +NOTICE: We copied a materialized virtual slot! so I think it's probably not that common that we'd gain from that optimisation. What might buy us a bit more would be to get rid of the for loop inside tts_virtual_copyslot() and copy the guts of tts_virtual_materialize() into tts_virtual_copyslot() and set the dstslot tts_isnull and tts_values arrays in the same loop that we use to calculate the size. I tried that in the attached patch and then tested it alongside the patch that changes the slot type. master = 74604a37f 1 = [1] 2 = optimize_tts_virtual_copyslot.patch Using the script from [2] and the setup from [3] but reduced to 10k tuples instead of 1 million. Times the average query time in milliseconds for a 60 second pgbench run. query master master+1 master+1+2 m/m+1 m/m+1+2 Q1 2.616 2.082 1.903 125.65% 137.47% Q2 9.479 10.593 10.361 89.48% 91.49% Q3 10.329 10.357 10.627 99.73% 97.20% Q4 7.272 7.306 6.941 99.53% 104.77% Q5 7.597 7.043 6.645 107.87% 114.33% Q6 162.177 161.037 162.807 100.71% 99.61% Q7 59.288 59.43 61.294 99.76% 96.73% 103.25% 105.94% I only expect Q2 and Q3 to gain from this. Q1 shouldn't improve but did, so the results might not be stable enough to warrant making any decisions from. I was uncertain if the old behaviour of when srcslot contains fewer attributes than dstslot was intended or not. What happens there is that we'd leave the additional old dstslot tts_values in place and only overwrite up to srcslot->natts but then we'd go on and materialize all the dstslot attributes. I think this might not be needed as we do dstslot->tts_nvalid = srcdesc->natts. I suspect we may be ok just to materialize the srcslot attributes and ignore the previous additional dstslot attributes. Changing it to that would make the attached patch more simple. David [1] https://www.postgresql.org/message-id/attachment/151110/use_subnode_slot_type_for_nodeunique.patch [2] https://www.postgresql.org/message-id/attachment/151342/uniquebench.sh.txt [3] https://www.postgresql.org/message-id/CAExHW5uhTMdkk26oJg9f2ZVufbi5J4Lquj79MdSO%2BipnGJ_muw%40mail.gmail.com
diff --git a/src/backend/executor/execTuples.c b/src/backend/executor/execTuples.c index 2c2712ceac..4f8c6c24c0 100644 --- a/src/backend/executor/execTuples.c +++ b/src/backend/executor/execTuples.c @@ -251,7 +251,10 @@ tts_virtual_materialize(TupleTableSlot *slot) static void tts_virtual_copyslot(TupleTableSlot *dstslot, TupleTableSlot *srcslot) { - TupleDesc srcdesc = srcslot->tts_tupleDescriptor; + TupleDesc srcdesc = srcslot->tts_tupleDescriptor; + TupleDesc dstdesc = dstslot->tts_tupleDescriptor; + Size sz = 0; + char *data; Assert(srcdesc->natts <= dstslot->tts_tupleDescriptor->natts); @@ -259,17 +262,105 @@ tts_virtual_copyslot(TupleTableSlot *dstslot, TupleTableSlot *srcslot) slot_getallattrs(srcslot); - for (int natt = 0; natt < srcdesc->natts; natt++) + /* make sure storage doesn't depend on external memory */ + for (int natt = 0; natt < dstdesc->natts; natt++) { - dstslot->tts_values[natt] = srcslot->tts_values[natt]; - dstslot->tts_isnull[natt] = srcslot->tts_isnull[natt]; + Form_pg_attribute att = TupleDescAttr(dstdesc, natt); + Datum val; + bool isnull; + + /* copy the values/isnull from the srcslot */ + if (natt < srcdesc->natts) + { + dstslot->tts_values[natt] = val = srcslot->tts_values[natt]; + dstslot->tts_isnull[natt] = isnull = srcslot->tts_isnull[natt]; + } + else + { + /* + * when dstslot has more ntts than srcslot we must materialize + * the values from dstslot + */ + val = dstslot->tts_values[natt]; + isnull = dstslot->tts_isnull[natt]; + } + + if (att->attbyval || isnull) + continue; + + if (att->attlen == -1 && + VARATT_IS_EXTERNAL_EXPANDED(DatumGetPointer(val))) + { + /* + * We want to flatten the expanded value so that the materialized + * slot doesn't depend on it. + */ + sz = att_align_nominal(sz, att->attalign); + sz += EOH_get_flat_size(DatumGetEOHP(val)); + } + else + { + sz = att_align_nominal(sz, att->attalign); + sz = att_addlength_datum(sz, att->attlen, val); + } + } + + /* do we need to materialize any columns? */ + if (sz > 0) + { + VirtualTupleTableSlot *vdstslot = (VirtualTupleTableSlot *) dstslot; + + /* allocate memory */ + vdstslot->data = data = MemoryContextAlloc(dstslot->tts_mcxt, sz); + dstslot->tts_flags |= TTS_FLAG_SHOULDFREE; + + /* and copy all attributes into the pre-allocated space */ + for (int natt = 0; natt < dstdesc->natts; natt++) + { + Form_pg_attribute att = TupleDescAttr(dstdesc, natt); + Datum val; + + if (att->attbyval || dstslot->tts_isnull[natt]) + continue; + + val = dstslot->tts_values[natt]; + + if (att->attlen == -1 && + VARATT_IS_EXTERNAL_EXPANDED(DatumGetPointer(val))) + { + Size data_length; + + /* + * We want to flatten the expanded value so that the + * materialized slot doesn't depend on it. + */ + ExpandedObjectHeader *eoh = DatumGetEOHP(val); + + data = (char *) att_align_nominal(data, att->attalign); + data_length = EOH_get_flat_size(eoh); + EOH_flatten_into(eoh, data, data_length); + + dstslot->tts_values[natt] = PointerGetDatum(data); + data += data_length; + } + else + { + Size data_length = 0; + + data = (char *) att_align_nominal(data, att->attalign); + data_length = att_addlength_datum(data_length, att->attlen, + val); + + memcpy(data, DatumGetPointer(val), data_length); + + dstslot->tts_values[natt] = PointerGetDatum(data); + data += data_length; + } + } } dstslot->tts_nvalid = srcdesc->natts; dstslot->tts_flags &= ~TTS_FLAG_EMPTY; - - /* make sure storage doesn't depend on external memory */ - tts_virtual_materialize(dstslot); } static HeapTuple