[HACKERS] New approach to ye olde cross-datatype indexing problem

Tom Lane Sat, 08 Nov 2003 09:01:49 -0800

Once again it's time to consider that old problem that indexes can't
handle cross-datatype comparisons, for example
        SELECT ... WHERE int8col = 42
won't use an index on int8col because 42 is int4.


We have spent a huge amount of time trying to find ways to get the system
to assign the same type to the comparison value as the indexed column has.
So far every such proposal has failed, because of unwanted side-effects on
the semantics of operations unrelated to indexes.  I've become more and
more convinced that that approach is a dead end.

What if we attack the problem head on --- make indexes able to handle
cross-datatype comparison operators directly?  That would solve the
problem without *any* semantic side-effects.  AFAIR we've never seriously
considered doing that; I guess anyone who thought of it dismissed it as
too hard.  But I've been studying the idea for the last day or so and I'm
starting to think that it's eminently practical.

Here's what I'm thinking of: specify that the "input" datatype of an
operator class (pg_opclass.opcintype) is actually just the type of the
indexed column.  Operators that are members of the opclass must take this
type as their left-hand input, but the right-hand input can be some other
type.  pg_amop gets an additional column that is the right-hand data type
of the operator, and its primary key becomes (opclass, righthandtype,
strategy) rather than just (opclass, strategy).  For example, the btree
opclass for int8 will still use strategy number 1 for 'less than', but
it would now contain entries for int84lt, int82lt, etc in addition to
int8lt. We can still maintain the secondary unique index on (opclass,
operator) --- that is, any particular operator still stands in one unique
relationship to an opclass.  The support procedure catalog pg_amproc
likewise needs to get an additional column that is the "right hand side"
datatype.  For example, the btree opclass for int8 formerly had one
support procedure btint8cmp(int8, int8) but it will now need additional
entries for btint84cmp(int8, int4), btint82cmp(int8, int2) and so on.

The cross-datatype operators we need for this largely already exist ---
in fact it's their presence in the system catalogs that creates the
problem in the first place.  We'll need to write new functions for the
additional support procedures, but there aren't all that many of them.

The existing CREATE OPERATOR CLASS syntax already works for this approach,
since it's possible to specify the input datatypes of an operator.  We
just need to modify the code to allow multiple instances of the same
operator or function number as long as the input datatypes are different.

Inside the system, the changes needed are really rather minimal.  We will
need to add fields to the ScanKey data structure that specify the strategy
number of the operator being invoked and the righthand datatype (ie, the
actual datatype of the sk_argument datum).  Given that information and
knowledge of the index column's opclass, the actual operator can be looked
up in the pg_amop syscache.  The reason I want to add the strategy number
to ScanKey is that this is something the planner knows already (it finds
it out when it discovers that an operator can be used with the index in
the first place) and it seems silly to require the index access method to
re-discover the strategy number from the operator procedure OID.  Passing
the strategy number directly will save cycles.  Furthermore, it
essentially eliminates the need for the "StrategyMap" mechanism (istrat.c
and friends), which is a big chunk of vestigial code that I've been
wanting to get rid of for awhile.  The only thing the StrategyMap routines
are really doing for us is providing a procedure-OID-to-strategy-number
reverse lookup mechanism, and we won't need that anymore.  This is a good
thing since the existing StrategyMap data structure falls apart in the
presence of multiple operators with the same strategy number.

Relation cache entries for indexes will still include rd_operator and
rd_support fields indexed by strategy and procedure number.  We will
decree that the cached entries are only for the operators and procedures
whose righthand datatype is the same as the indexed datatype.  This is
reasonable since these will still be the most-used items (for example,
the index access method will need to use them to do internal comparisons
between index items).

The index access methods will need to be careful about whether they are
comparing two index entries (necessarily of the same datatype) or an index
entry against an externally-supplied scankey datum (possibly a different
datatype requiring a different operator).  In the btree code the changes
needed are pretty minimal.  I haven't looked at the other index types yet,
but actually I don't care much whether they can support cross-datatype
operators or not.  If it turns out to be at all hard for them to do it,
we can just say that they don't support it.  Fixing btree will solve 99.9%
of the problem anyway.

One way in which we will lose some flexibility is that this design nails
down forevermore the assumption that the indexed column is on the lefthand
side of any indexable clause.  This has been a de facto restriction for a
long time, but it will become a permanent restriction embedded in the
system catalog design.  I don't see that this is a problem though --- the
existing solution of letting the planner try to commute the clause to put
the indexkey on the left seems to work fine.

I'm planning to push forward with trying to implement this.  Any comments,
ideas, objections?

                        regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 7: don't forget to increase your free space map settings

[HACKERS] New approach to ye olde cross-datatype indexing problem

Reply via email to