Hi,
Dne 04. 11. 21 v 16:56 Karol Bieniaszewski napsal(a):
„The main reason why this "removal of trailing partial contraction" was
done is to achieve behavior "consistent" with search/evaluation in other
software (like text editors etc.), so (for example) STARTING WITH "C" or
LIKE "C%" will return rows starting with "C" or "CH".
This behavior itself is questionable (but more about that later)
„
I try to understand the issue but meybe simpler description is required.
When i use WIN1250 collate PXW_PLK (my Polish language)
Then engine threat all „ch” as single letter in the index?
And i cannot find looking by only „c” without „h” with index lookup?
If yes how engine decide that this is „ch” or „c” and „h” in words?
This is one from dark, messy corners of IT like time zones...
First and foremost, Polish language AFAIK does not define "CH" as
separate letter (or use any other contractions), so this issue does not
affect PXW_PLK (or other Polish collations) at all.
Collations are primarily used to define order. Contractions and
expansions are special rules that allow to treat groups as units or
units as groups for ordering purposes.
For example, Czech "CH" letter is ordered between "H" and "I".
Unfortunately, it's not defined as single letter in most character sets.
Although there are charsets like KOI-8 CS2 that have it, it's not
defined in ones that are really used (i.e. pushed by dominant players
like Windows OS). Czech IT settled on use of WIN1250 or ISO-8859-2 that
does not define "CH" as single letter, so proper order of "CH" depends
on collation "contraction" rule.
Collation is used to produce "sortkeys", which is basically
transformation of given key to binary sequence that could be compared
directly (bytewise) with other sortkeys. In RDBMS, the sortkey is used
for all sort operations, and for index keys (as they also rely on
comparison as it's binary tree). The same apply for comparison
predicates (i.e. col >= value etc.).
The problem is, that when you look up for partial key that ends with a
character that is start of contraction, you enter somewhat gray territory.
For example: The key "HROCH" (Czech word for hippopotamus) could have
sortkey: 13+27+23+14 (each letter replaced with byte representing it's
order place in alphabet, which is simplification for this showcase, i.e.
13=H 27=R 23=O 14=CH)
If you would look for STARTING WITH "HROC" (or LIKE "HROC%"), you have a
partial key that would transform to sortkey: 13+27+23+4 (4=C) which will
not match "HROCH", but match "HROCENI" (roaring). This is actually
correct from Czech language POV, but not from IT POV when you work on
character level of characters defined in charset.
Firebird currently "solves" that by removing the trailing partial
contraction from sortkey, i.e. it will return sortkey for "HRO" instead
for "HROC", so keys like "HROCH" or "HROCENI" are matched. Sure, it may
match also keys like "HROB" (grave), but these are latter eliminated
from result set by final expression evaluation. Remember, that index
lookup just collect candidates that are then read and verified by
expression evaluation (necessary step due to MGA).
As you can see, this approach leads to excess I/O. The scale of
additional I/O depends on your data, and is highly influenced by
character set - UNICODE/UTF8 requires more storage space, which means
that the same number of rows could occupy much more data pages (in some
cases significantly more).
However, the Czech language defines only one contraction (CH), but there
are other languages that have many such as Hungarian.
Btw, if you are interested in collations, take a look at
https://collation-charts.org/
best regards
Pavel
Firebird-Devel mailing list, web interface at
https://lists.sourceforge.net/lists/listinfo/firebird-devel