On 02/11/2021 07:57, Pavel Cisar wrote: > Problem: > > This issue is related to intl. support for some languages, use of > collations and LIKE and STARTING WITH predicates,
And SIMILAR. And to some degree, SUBSTRING, CHAR_LENGTH, REPLACE, OVERLAY, etc. > and >= / <= operators. This is much less difficult to fix, it has no user visible implications. > Specifically, when certain values are used on columns data in certain > languages & collations, the performance may drop significantly (scale > depends on actual query and data distribution). > > Analysis revealed, that when last character of the lookup value is part > of "contraction" in used language, it's removed from the key by engine, > so the lookup is done for shortened key and thus may include much more > candidate rows - that are later eliminated from result by expression > evaluation. The removal is done in Utf16Collation::stringToKey and > LC_NARROW_string_to_key functions, and thus affects collations based on > Utf16 and NARROW, but not in KSC, UCS2, BIG5, JIS or GB2312 charsets. > > The "contraction" is a facility to handle multi-character letters You well said relation of "characters" and "letters". Firebird (as well Java, .NET, etc) operates on characters. CH (even in Czech) has two characters (in Unicode and in any character set present in Firebird). If you look at the standard it describes SUBSTRING in terms of characters. And LIKE in terms of SUBSTRINGs! Of course, the same for SIMILAR, CHAR_LENGTH, etc. > in certain languages, like Spanish or Czech "ch" that is composed from "c" > and "h". The main purpose of contractions is handling of special > ordering rules for these letters, but has more broader meaning and use. > There are many languages that use contractions (but the issue was first > reported for Czech language), and contractions may consists from more > than two characters (some contraction rules could be quite complex, for > example there are "contractions with expansions" in Hungarian language). > > The main reason why this "removal of trailing partial contraction" was > done is to achieve behavior "consistent" with search/evaluation in other > software (like text editors etc.), so (for example) STARTING WITH "C" or > LIKE "C%" will return rows starting with "C" or "CH". > Yes, "consistent". Let's talk more about it later. > This behavior itself is questionable (but more about that later), but > the main issue was the performance degradation. It's caused by increased > amount of I/O, as engine may process more candidate rows (selected by > index) than necessary. The amount of excess I/O depends greatly on data > distribution and used character set. For example, charset WIN1250 with > PDOX_CSY collation is affected (has contraction for "CH"), but WIN1250 > is NARROW charset so data are dense and stored on less pages, while > charset UTF8 with collation for LOCALE=cs_CZ uses much more space (up to > 2x-3x times) and thus more I/O for the same set of data. > > The (non-)solution to performance issue was to use collations WITHOUT > contractions, which is not acceptable by end users, as it would also > lead in wrong order of data in ORDER BY / GROUP BY that would not > respect ordering rules for multi-character letters. > > So, a better fix for performance issue is needed. Please note that > although this issue has very specific conditions to manifest, once user > is affected, the impact could be quite severe, especially for UTF8-based > data. It's also important to consider here that it's not possible to > work around or minimize the chance that this issue manifests by database > or application design. > >From some degree, it's possible in the same manner as it's very questionable when application allows user to enter a single character and display a list of all names starting with that character. For small tables, it's practical and fast. For large ones, neither. I always said the test case is artificial because if you add some more data, that pattern will manifest performance problem for all characters in all character sets. Nobody is saying there is no problem with the subject performance. > Proposed solutions: > > There are two "competing" solutions that BOTH solve the performance > problem, but have different consequences. > > 1. Solution (proposed by Vlad and me): > > Use collation sortkey also in non-indexed lookup / comparisons and get > rid of the trailing partial contraction removal code, so collation > sortkeys would be used in both cases. > > We tested patched 3.0.7 engine that does not do trailing partial > contraction removal, and can confirm that the performance problem is > gone. It's the same as test a pattern with a character not part of contraction. > It appears that only the STARTING WITH (that covers LIKE 'X%' > patterns) needs to be adapted to use sortkeys for non-indexed lookup to > have consistent results. Casual expressions as col >= 'C' works > normally. Other pattern matching expression types (LIKE patterns that > couldn't be converted to STARTING WITH, regex) work on character level > and there isn't any type of "key" involved, so using collation is not > expected nor desired. > SIMILAR TO 'constant%' is also optimized to use index. > However, there are some changes in behavior. The STARTING WITH will not > include contractable data that COULD be partially matched by lookup key, > i.e. STARTING WITH 'C' will not match string starting with 'CH'. From > language sense this is the CORRECT behavior, that is unfortunately not > supported by ALL software. Users are just used that there are some cases > when they get apples + oranges while asking for apples, because it's how > it works now. I'm not YET aware of any other case that would yield > different results. There could be some strange collation with > contractions that would sort contracted unit before it's first letter, > so output from query with condition like col >= 'X' would return > different set, but I'm not aware of any. > You talked about "consistency" before. The solution you propose is not SQL compliant (see above). LIKE is very related to STARTING WITH. SIMILAR TO is very related to LIKE. All of this is very related to SUBSTRING and CHAR_LENGTH. So while SQL compliance being the least important thing here, you cannot advocate for this solution talking about consistency. This solution (if not fully applied to all operations) is full of inconsistencies. > Although the new behavior is not supported by all software, You mean, most software do not do this. > it IS CONSISTENT with how MS SQLServer and Oracle handle these cases (which we > verified). This fact was highlighted by asked users (that often build > applications supporting multiple RDBMS) as another reason (beside that > it's how it has to work in first place) why this solution should be the > preferred one. > > IF we will implement this solution, we will introduce configuration > option to switch between new and old behavior. If this will be done in > point releases, it should be *disabled* (old way) by default and enabled > explicitly via config option. v5 may have this enabled unconditionally. > This solution also breaks NUMERIC-SORT collations. '1' and '100' both starts with '1'. I guess you do not want to change that. This solution breaks it. > 2. Solution (proposed by Adriano): > > Replace the trailing partial contraction removal code with code that > would make multiple lookups for all parts of > possible contraction, for example STARTING WITH "C" will perform lookup > for "C" and "CH". > > Experiments show that this should also solve the performance problem, > while keeping the current behavior. However, the current behavior is not > exactly what asked affected users want. > > Here is the opinion from Dmitry Yemanov: > > I'm not really happy with comparing only to Oracle/MSSQL. MySQL seems to > be in the opposite camp and agrees on 'ch' like 'c%' is true in their > utf_czech_ci collation. PostgreSQL is somewhat difficult to compare > against, as AFAIR it still relies on the system locales and does not > have any built-in collations. > PostgreSQL has unicode (ICU) collations and works like Firebird. It consistently searches indexed and non-indexed data. I do not know how they do it (multiple lookups, sort key contraction, other solution). > And regardless, I give more priority to the "natural language" argument > than to the "compatibility" one. However, it's good when they match each > other. > > For me two things are obvious: > > 1) Regular comparisons (as well as sorting) must use all collation > rules, including contractions > > 2) There should be no difference (except performance) between indexed > and non-indexed access > This has being said by all commenting. In earlier Firebird (and Interbase) indexed and non-indexed lookups with 'C' was returning different results. And it was considered a no-go way to revert to this behavior because it is inconsistent. So consistency is important or not? > But the rest is really complicated for me. We had INTL support for kinda > everything, but SIMILAR was implemented later and uses a different > rules. Maybe in the ideal world it could also take collations into > account, but we're surely not in position to do that ourselves, and > given that no other DBMS seem to have collation-aware SIMILAR, let's > just consider it a fact we live with. > SIMILAR indeed does not use collation. But it emulates things via string transformation. If a charset/collation does funky things, it may be inconsistent. Nobody so far reported any problem about this. > LIKE and STARTING WITH are in the middle between comparisons/sorts and > SIMILAR though. If we treat LIKE being a close friend for SIMILAR and > STARTING WITH being a shorthand for LIKE, we may end with both ignoring > the collation rules. If we treat STARTING WITH as a custom shorthand for > greater-or-less comparisons, then it should be collation-aware. And breaks NUMERIC-SORT. Adriano Firebird-Devel mailing list, web interface at https://lists.sourceforge.net/lists/listinfo/firebird-devel