Many have expressed their interest in this topic, but I haven't seen any
design of how it should work. Here's my attempt; I've been playing with
this for some time now and I think what I propose here is a good initial
plan. This will allow us to write permanent table storage that works
differently than heapam.c. At this stage, I haven't throught through
whether this is going to allow extensions to define new storage modules;
I am focusing on AMs that can coexist with heapam in core.
The design starts with a new row type in pg_am, of type "s" (for "storage").
The handler function returns a struct of node StorageAmRoutine. This
contains functions for 1) scans (beginscan, getnext, endscan) 2) tuples
(tuple_insert/update/delete/lock, as well as set_oid, get_xmin and the
like), and operations on tuples that are part of slots (tuple_deform,
To support this, we introduce StorageTuple and StorageScanDesc.
StorageTuples represent a physical tuple coming from some storage AM.
It is necessary to have a pointer to a StorageAmRoutine in order to
manipulate the tuple. For heapam.c, a StorageTuple is just a HeapTuple.
RelationData gains ->rd_stamroutine which is a pointer to the
StorageAmRoutine for the relation in question. Similarly,
TupleTableSlot is augmented with a link to the StorageAmRoutine to
handle the StorageTuple it contains (probably in most cases it's set at
the same time as the tupdesc). This implies that routines such as
ExecAssignScanType need to pass down the StorageAmRoutine from the
relation to the slot.
The executor is modified so that instead of calling heap_insert etc
directly, it uses rel->rd_stamroutine to call these methods. The
executor is still in charge of dealing with indexes, constraints, and
any other thing that's not the tuple storage itself (this is one major
point in which this differs from FDWs). This all looks simple enough,
with one exception and a few notes:
exception a) ExecMaterializeSlot needs special consideration. This is
used in two different ways: a1) is the stated "make tuple independent
from any underlying storage" point, which is handled by
ExecMaterializeSlot itself and calling a method from the storage AM to
do any byte copying as needed. ExecMaterializeSlot no longer returns a
HeapTuple, because there might not be any. The second usage pattern a2)
is to create a HeapTuple that's passed to other modules which only deal
with HT and not slots (triggers are the main case I noticed, but I think
there are others such as the executor itself wanting tuples as Datum for
some reason). For the moment I'm handling this by having a new
ExecHeapifyTuple which creates a HeapTuple from a slot, regardless of
the original tuple format.
note b) EvalPlanQual currently maintains an array of HeapTuple in
EState->es_epqTuple. I think it works to replace that with an array of
StorageTuples; EvalPlanQualFetch needs to call the StorageAmRoutine
methods in order to interact with it. Other than those changes, it
note c) nodeSubplan has curTuple as a HeapTuple. It seems simple
to replace this with an independent slot-based tuple.
note d) grp_firstTuple in nodeAgg / nodeSetOp. These are less
simple than the above, but replacing the HeapTuple with a slot-based
tuple seems doable too.
note e) nodeLockRows uses lr_curtuples to feed EvalPlanQual.
TupleTableSlot also seems a good replacement. This has fallout in other
users of EvalPlanQual, too.
note f) More widespread, MinimalTuples currently use a tweaked HeapTuple
format. In the long run, it may be possible to replace them with a
separate storage module that's specifically designed to handle tuples
meant for tuplestores etc. That may simplify TupleTableSlot and
execTuples. For the moment we keep the tts_mintuple as it is. Whenever
a tuple is not already in heap format, we heapify it in order to put in
The current heapam.c routines need some changes. Currently, practice is
that heap_insert, heap_multi_insert, heap_fetch, heap_update scribble on
their input tuples to set the resulting ItemPointer in tuple->t_self.
This is messy if we want StorageTuples to be abstract. I'm changing
this so that the resulting ItemPointer is returned in a separate output
argument; the tuple itself is left alone. This is somewhat messy in the
case of heap_multi_insert because it returns several items; I think it's
acceptable to return an array of ItemPointers in the same order as the
input tuples. This works fine for the only caller, which is COPY in
batch mode. For the other routines, they don't really care where the
TID is returned AFAICS.
Additional noteworthy items:
i) Speculative insertion: the speculative insertion token is no longer
installed directly in the heap tuple by the executor (of course).
Instead, the token becomes part of the slot. When the tuple_insert
method is called, the insertion routine is in charge of setting the
token from the slot into the storage tuple. Executor is in charge of
calling method->speculative_finish() / abort() once the insertion has
been confirmed by the indexes.
ii) execTuples has additional accessors for tuples-in-slot, such as
ExecFetchSlotTuple and friends. I expect to have some of them to return
abstract StorageTuples, others HeapTuple or MinimalTuples (possibly
wrapped in Datum), depending on callers. We might be able to cut down
on these later; my first cut will try to avoid API changes to keep
fallout to a minimum.
iii) All tuples need to be identifiable by ItemPointers. Storages that
have different requirements will need careful additional thought across
iv) System catalogs cannot use pluggable storage. We continue to use
heap_open etc in the DDL code, in order not to make this more invasive
that it already is. We may lift this restriction later for specific
catalogs, as needed.
v) Currently, one Buffer may be associated with one HeapTuple living in a
slot; when the slot is cleared, the buffer pin is released. My current
patch moves the buffer pin to inside the heapam-based storage AM and the
buffer is released by the ->slot_clear_tuple method. The rationale for
doing this is that some storage AMs might want to keep several buffers
pinned at once, for example, and must not to release those pins
individually but in batches as the scan moves forwards (say a batch of
tuples in a columnar storage AM has column values spread across many
buffers; they must all be kept pinned until the scan has moved past the
whole set of tuples). But I'm not really sure that this is a great
I welcome comments on these ideas. My patch for this is nowhere near
completion yet; expect things to change for items that I've overlooked,
but I hope I didn't overlook any major. If things are handwavy, it is
probably because I haven't fully figured them out yet.
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Sent via pgsql-hackers mailing list (firstname.lastname@example.org)
To make changes to your subscription: