Inspired by this thread [1], and in particular by the idea of storing
three numbers (permanent ID, on-disk storage position, display
position) for each column, I spent a little time messing around with a
prototype implementation of column storage positions to see what kind
of difference it would make.  The results were encouraging: on a table
with 20 columns of alternating smallint and varchar(10) datatypes,
selecting the max() of one of the rightmost int columns across 1
million rows ran around 3 times faster.  The same query on the
leftmost varchar column (which should suffer the most from this
change) predictably got a little slower (about 10%); I couldn't
measure a performance drop on the rightmost varchar columns.  The
table's size didn't drop much in this case, but a different table of
20 alternating int and smallint columns showed a 20% slimmer disk
footprint, pretty much as expected.  Pgbenching showed no measurable
difference, which isn't surprising since the pgbench test tables
consist of just int values with char filler at the end.

So here is a proposal for separating a column's storage position from
its permanent ID.  I've ignored the display position piece of the
original thread because display positions don't do much other than
save you the hassle of creating a view on top of your table, while
storage positions have demonstrable, tangible benefits.  And there is
no reason to connect the two features; display positions can easily be
added separately at a later point.

We want to decouple a column's on-disk storage position from its
permanent ID for two reasons: to minimize the space lost to alignment
padding between fields, and to speed up access to individual fields.
The system will automatically assign new storage positions when a
table is created, and when a table alteration requires a rewrite
(currently just adding a column with a default, or changing a column
datatype).  To allow users to optimize tables based on the fields they
know will be frequently accessed, I think we should extend ALTER TABLE
to accept user-assigned storage positions (something like "ALTER TABLE
ALTER col SET STORAGE POSITION X").  This command would also be useful
for another reason discussed below.

In my prototype, I used these rules to determine columns' storage order:
1) fixed-width fields before variable-width, dropped columns always last
2) fixed-width fields ordered by increasing size
3) not-null fields before nullable fields
There are other approaches worth considering - for example, you could
imagine swapping the priority of rules 2 and 3.  Resultant tables
would generally have more alignment waste, but would tend to have
slightly faster field access.  I'm really not sure what the optimal
strategy is since every user will have a slightly different metric for
"optimal".  In any event, either of these approaches is better than
the current situation.

To implement this, we'll need a field (perhaps attstoragepos?) in
pg_attribute to hold the storage position.  It will equal attnum until
it is explicitly reassigned.  The routines in heaptuple.c need to
quickly loop through the fields of a tuple in storage order rather
than attnum order, so I propose extending TupleDesc to hold an
"attrspos" array that sits alongside the attrs array.  In the
prototype I used an array of int2 indices into the attrs array,
ordered by storage position.

These changes cause a problem in ExecTypeFromTLInternal: this function
calls CreateTemplateTupleDesc followed by TupleDescInitEntry, assuming
that attnum == attstoragepos for all tuples.  With the introduction of
storage positions, this of course will no longer be true.  I got
around this by having expand_targetlist, build_physical_tlist, and
build_relation_tlist make sure each TargetEntry (for targetlists
corresponding to either insert/update tuples, or base tuples pulled
straight from the heap) gets a correct resorigtbl and resname.  Then
ExecTypeFromTLInternal first tries calling a new function
TupleDescInitEntryAttr, which hands off to TupleDescInitEntry and then
performs a syscache lookup to update the storage position using the
resorigtbl.  This is a little ugly because ExecTypeFromTLInternal
doesn't know in advance what kind of tupledesc it's building, so it
needs to retreat to the old method whenever the syscache lookup fails,
but it was enough to pass the regression tests.  I could use some
advice on this - there's probably a better way to do it.

Another problem relates to upgrades.  With tools like pg_migrator now
on pgfoundry, people will eventually expect quick upgrades that don't
require rewriting each table's data.  Storage positions would cause a
problem for every version X -> version Y upgrade with Y >= 8.3, even
when X is also >= 8.3, because a version X table could always have
been altered without a rewrite into a structure different from what
Y's CREATE TABLE will choose.  I don't think it's as simple as just
using the above-mentioned ALTER TABLE extension to assign the proper
storage positions for each field, because the version X table could
have dropped columns that might or might not be present in any given
tuple on disk.  The best solution I can see is having pg_dump create a
table covering *all* columns (including dropped ones) with explicit
storage positions, and then immediately issue an alter statement to
get rid of the dropped columns.  I'm not thrilled about this approach
(in particular, preserving dropped columns across upgrades seems
sloppy), but I haven't been able to think of anything better.
Hopefully I'm missing a simpler way to do this.

Comments and ideas?  Does this whole thing seem worthwhile to do?

phil


[1] http://archives.postgresql.org/pgsql-hackers/2006-12/msg00780.php

---------------------------(end of broadcast)---------------------------
TIP 7: You can help support the PostgreSQL project by donating at

               http://www.postgresql.org/about/donate

Reply via email to