On 19 June 2012 19:44, Peter Geoghegan <pe...@2ndquadrant.com> wrote: > PostgreSQL supported Unicode before 2005, when the tie-breaker was > introduced. I know at least one Swede who used Postgres95. I just took > a look at the REL6_4 branch, and it looks much the same in 1999 as it > did in 2005, in that there is no tie-breaker after the strcoll(). Now, > that being the case, and Hungarian in particular having a whole bunch > of these equivalencies, I have to wonder if the original complainant's > problem really was diagnosed correctly. It could of had something to > do with the fact that texteq() was confused about whether it reported > equality or equivalency - it may have taken that long for the (len1 != > len2) fastpath thing (only holds for equality, not equivalence, > despite the fact that the 2005-era strcoll() call checks equivalence > within texteq() ) to trip someone out, because texteq() would have > thereby given inconsistent answers in a very subtle way, that were not > correct either according to the Hungarian locale, nor according to > simple bitwise equality.
It seems likely that this is more-or-less correct. The two equivalent strings had a variable number of characters, so the fastpath made texteq not accord with varstr_cmp(), even though texteq() itself also only had a single strcoll() call. So there was some tuples with the hungarian string "potty" in the 2005 bug report. They were not visible for any of the queries seen in test cases, even the "good" ones. There were a few tuples with equivalent strings like "potyty" that were visible. The index scans didn't fail to return the expected tuples because the indexes were internally inconsistent or otherwise corrupt. Rather, the _bt_checkkeys() function returned early because the faulty "half equivalence, half equality" texteq() comparator reported that the tuple returned didn't satisfy the qual, and on that basis the index scan stopped. This usually wouldn't happen with Swedish, because their equivalencies tend to be one character long, and are less common. So far, so good, but how did this not blow-up sooner? Did Hungarians only start using Postgres in late 2005, immediately after the 8.1 release? Hardly. Commit c1d62bfd00f4d1ea0647e12947ca1de9fea39b33, made in late 2003, "Add operator strategy and comparison-value datatype fields to ScanKey", may be part of the problem here. Consider this test within _bt_checkkeys(), that was changed by that commit: - if (key->sk_flags & SK_COMMUTE) - test = FunctionCall2(&key->sk_func, - key->sk_argument, datum); - else - test = FunctionCall2(&key->sk_func, - datum, key->sk_argument); + test = FunctionCall2(&key->sk_func, datum, key->sk_argument); - if (DatumGetBool(test) == !!(key->sk_flags & SK_NEGATE)) + if (!DatumGetBool(test)) I think that this change may have made the difference between the Hungarians getting away with it and not getting away with it. Might it have been that for text, they were using some operator that wasn't '=' (perhaps one which has no fastpath, and thus correctly made a representation about equivalency) rather than texteq prior to this commit? I didn't eyeball the pg_amop entries of the era myself, but it seems quite possible. In any case, I find it hard to believe that it took at least ten years for this problem to manifest itself just because it took that long for a Hungarian with a strcoll() implementation that correctly represented equivalency to use Postgres. A year is a more plausible window. If we do introduce an idea of equivalency to make all this work, that means there'll have to be equivalency verification when equality verification returned false in a number of places, including the above. For Gist, there is an equivalent test will still vary based on the strategy number used dubbed "the consistent function", which seems analogous to the above. So, you're going to have an extra strcoll()/strxfrm() + strcmp() here, as part of a "not-equal-but-maybe-equivalent" test, which is bad. However, if that means that we can cache a text constant as a strxfrm() blob, and compare in a strxfrm()-wise fashion, that will more than pay for itself, even for btree traversal alone. For a naive strxfrm() + strcoll() implementation, that will save just under half of the work, and everyone knows that the cost of the comparison is what dominates here, particularly for certain collations. We'd probably formalise it to the point where there'd be a btree strategy number and fully-fledged equivalency operator that the user could conceivably use themselves. There seems to be scope-creep here. I'm not sure that I should continue with this as part of this review. Maybe this should be something that I work on for the next commitfest. It would be nice to hear what others thought of these ideas before I actually start writing a patch that both fixes these problems (our behaviour is incorrect for some locales according to the Unicode standard), facilitates a strxfrm() optimisation, and actually adds a strxfrm() optimisation. -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services -- Sent via pgsql-hackers mailing list (firstname.lastname@example.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers