On Sun, Apr 7, 2013 at 7:43 PM, Tom Lane <t...@sss.pgh.pa.us> wrote: > While reviewing the latest incarnation of the regex indexing patch, > I noticed that make_trigrams() in contrib/pg_trgm/trgm_op.c is coded > so that if USE_WIDE_UPPER_LOWER is not set, it ignores multibyte > character boundaries and just makes trigrams from 3-byte substrings. > This seems slightly insane, not least because there's an Assert there > that will fail if it's fed any multibyte characters. I suppose no one > has actually tried this code with non-ASCII data on machines where > USE_WIDE_UPPER_LOWER isn't set; at least not with Asserts turned on. > (Considering that even my favorite dinosaur HPUX machine has got both > HAVE_WCSTOMBS and HAVE_TOWLOWER, it may well be that there *aren't* any > such machines anymore.) > > So I'm inclined to remove the two #ifdef USE_WIDE_UPPER_LOWER tests > in trgm_op.c, and just use the multibyte-aware code all the time. > A downside of this is that if there is indeed anyone out there storing > non-ASCII trigrams on a machine without USE_WIDE_UPPER_LOWER, their > indexes would break if they pg_upgrade to 9.3. OTOH their indexes would > break anyway if they rebuilt against a more modern libc, or built with > Asserts on. > > If we don't do this then we'll have to complicate the regex indexing > patch some more, since it's currently imagining that cnt_trigram() > is always the way to make storable trigrams from raw text, and this > is just wrong in the existing non-USE_WIDE_UPPER_LOWER code path >
+1 for removing #ifdef USE_WIDE_UPPER_LOWER tests. Even if it works somewhere with non-ASCII data without USE_WIDE_UPPER_LOWER then anyway it's a buggy logic with invalid results. It's also likely we can change if (pg_database_encoding_max_length() > 1) into something like if (pg_database_encoding_max_length() > 1 && bytelen != charlen) ------ With best regards, Alexander Korotkov.