Hi,

Dne 04. 11. 21 v 16:56 Karol Bieniaszewski napsal(a):
„The main reason why this "removal of trailing partial contraction" was

done is to achieve behavior "consistent" with search/evaluation in other

software (like text editors etc.), so (for example) STARTING WITH "C" or

LIKE "C%" will return rows starting with "C" or "CH".

This behavior itself is questionable (but more about that later)

„

I try to understand the issue but meybe simpler description is required.

When i use WIN1250 collate PXW_PLK (my Polish language)

Then engine threat all „ch” as single letter in the index?

And i cannot find looking by only „c” without „h” with index lookup?

If yes how engine decide that this is „ch” or  „c” and „h” in words?

This is one from dark, messy corners of IT like time zones...

First and foremost, Polish language AFAIK does not define "CH" as separate letter (or use any other contractions), so this issue does not affect PXW_PLK (or other Polish collations) at all.

Collations are primarily used to define order. Contractions and expansions are special rules that allow to treat groups as units or units as groups for ordering purposes.

For example, Czech "CH" letter is ordered between "H" and "I". Unfortunately, it's not defined as single letter in most character sets. Although there are charsets like KOI-8 CS2 that have it, it's not defined in ones that are really used (i.e. pushed by dominant players like Windows OS). Czech IT settled on use of WIN1250 or ISO-8859-2 that does not define "CH" as single letter, so proper order of "CH" depends on collation "contraction" rule.

Collation is used to produce "sortkeys", which is basically transformation of given key to binary sequence that could be compared directly (bytewise) with other sortkeys. In RDBMS, the sortkey is used for all sort operations, and for index keys (as they also rely on comparison as it's binary tree). The same apply for comparison predicates (i.e. col >= value etc.).

The problem is, that when you look up for partial key that ends with a character that is start of contraction, you enter somewhat gray territory.

For example: The key "HROCH" (Czech word for hippopotamus) could have sortkey: 13+27+23+14 (each letter replaced with byte representing it's order place in alphabet, which is simplification for this showcase, i.e. 13=H 27=R 23=O 14=CH)

If you would look for STARTING WITH "HROC" (or LIKE "HROC%"), you have a partial key that would transform to sortkey: 13+27+23+4 (4=C) which will not match "HROCH", but match "HROCENI" (roaring). This is actually correct from Czech language POV, but not from IT POV when you work on character level of characters defined in charset.

Firebird currently "solves" that by removing the trailing partial contraction from sortkey, i.e. it will return sortkey for "HRO" instead for "HROC", so keys like "HROCH" or "HROCENI" are matched. Sure, it may match also keys like "HROB" (grave), but these are latter eliminated from result set by final expression evaluation. Remember, that index lookup just collect candidates that are then read and verified by expression evaluation (necessary step due to MGA).

As you can see, this approach leads to excess I/O. The scale of additional I/O depends on your data, and is highly influenced by character set - UNICODE/UTF8 requires more storage space, which means that the same number of rows could occupy much more data pages (in some cases significantly more).

However, the Czech language defines only one contraction (CH), but there are other languages that have many such as Hungarian.

Btw, if you are interested in collations, take a look at https://collation-charts.org/

best regards
Pavel



Firebird-Devel mailing list, web interface at 
https://lists.sourceforge.net/lists/listinfo/firebird-devel

Reply via email to