Re: [HACKERS] sortsupport for text

Peter Geoghegan Wed, 20 Jun 2012 05:37:22 -0700

On 19 June 2012 19:44, Peter Geoghegan <[email protected]> wrote:
> PostgreSQL supported Unicode before 2005, when the tie-breaker was
> introduced. I know at least one Swede who used Postgres95. I just took
> a look at the REL6_4 branch, and it looks much the same in 1999 as it
> did in 2005, in that there is no tie-breaker after the strcoll(). Now,
> that being the case, and Hungarian in particular having a whole bunch
> of these equivalencies, I have to wonder if the original complainant's
> problem really was diagnosed correctly. It could of had something to
> do with the fact that texteq() was confused about whether it reported
> equality or equivalency - it may have taken that long for the (len1 !=
> len2) fastpath thing (only holds for equality, not equivalence,
> despite the fact that the 2005-era strcoll() call checks equivalence
> within texteq() ) to trip someone out, because texteq() would have
> thereby given inconsistent answers in a very subtle way, that were not
> correct either according to the Hungarian locale, nor according to
> simple bitwise equality.


It seems likely that this is more-or-less correct. The two equivalent
strings had a variable number of characters, so the fastpath made
texteq not accord with varstr_cmp(), even though texteq() itself also
only had a single strcoll() call.

So there was some tuples with the hungarian string "potty" in the 2005
bug report. They were not visible for any of the queries seen in test
cases, even the "good" ones. There were a few tuples with equivalent
strings like "potyty" that were visible.

The index scans didn't fail to return the expected tuples because the
indexes were internally inconsistent or otherwise corrupt. Rather, the
_bt_checkkeys() function returned early because the faulty "half
equivalence, half equality" texteq() comparator reported that the
tuple returned didn't satisfy the qual, and on that basis the index
scan stopped. This usually wouldn't happen with Swedish, because their
equivalencies tend to be one character long, and are less common.

So far, so good, but how did this not blow-up sooner? Did Hungarians
only start using Postgres in late 2005, immediately after the 8.1
release? Hardly.

Commit c1d62bfd00f4d1ea0647e12947ca1de9fea39b33, made in late 2003,
"Add operator strategy and comparison-value datatype fields to
ScanKey", may be part of the problem here.

Consider this test within _bt_checkkeys(), that was changed by that commit:

-       if (key->sk_flags & SK_COMMUTE)
-           test = FunctionCall2(&key->sk_func,
-                                key->sk_argument, datum);
-       else
-           test = FunctionCall2(&key->sk_func,
-                                datum, key->sk_argument);
+       test = FunctionCall2(&key->sk_func, datum, key->sk_argument);

-       if (DatumGetBool(test) == !!(key->sk_flags & SK_NEGATE))
+       if (!DatumGetBool(test))

I think that this change may have made the difference between the
Hungarians getting away with it and not getting away with it. Might it
have been that for text, they were using some operator that wasn't '='
(perhaps one which has no fastpath, and thus correctly made a
representation about equivalency) rather than texteq prior to this
commit? I didn't eyeball the pg_amop entries of the era myself, but it
seems quite possible. In any case, I find it hard to believe that it
took at least ten years for this problem to manifest itself just
because it took that long for a Hungarian with a strcoll()
implementation that correctly represented equivalency to use Postgres.
A year is a more plausible window.

If we do introduce an idea of equivalency to make all this work, that
means there'll have to be equivalency verification when equality
verification returned false in a number of places, including the
above. For Gist, there is an equivalent test will still vary based on
the strategy number used dubbed "the consistent function", which seems
analogous to the above.

So, you're going to have an extra strcoll()/strxfrm() + strcmp() here,
as part of a "not-equal-but-maybe-equivalent" test, which is bad.
However, if that means that we can cache a text constant as a
strxfrm() blob, and compare in a strxfrm()-wise fashion, that will
more than pay for itself, even for btree traversal alone. For a naive
strxfrm() + strcoll() implementation, that will save just under half
of the work, and everyone knows that the cost of the comparison is
what dominates here, particularly for certain collations.

We'd probably formalise it to the point where there'd be a btree
strategy number and fully-fledged equivalency operator that the user
could conceivably use themselves.

There seems to be scope-creep here. I'm not sure that I should
continue with this as part of this review. Maybe this should be
something that I work on for the next commitfest.

It would be nice to hear what others thought of these ideas before I
actually start writing a patch that both fixes these problems (our
behaviour is incorrect for some locales according to the Unicode
standard),  facilitates a strxfrm() optimisation, and actually adds a
strxfrm() optimisation.

-- 
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

-- 
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] sortsupport for text

Reply via email to