Re: [Firebird-devel] ODP: RFC: Fix for issue 6915

Pavel Cisar Fri, 05 Nov 2021 04:36:51 -0700

Hi,

Dne 04. 11. 21 v 16:56 Karol Bieniaszewski napsal(a):

„The main reason why this "removal of trailing partial contraction" was


done is to achieve behavior "consistent" with search/evaluation in other

software (like text editors etc.), so (for example) STARTING WITH "C" or

LIKE "C%" will return rows starting with "C" or "CH".

This behavior itself is questionable (but more about that later)

„

I try to understand the issue but meybe simpler description is required.

When i use WIN1250 collate PXW_PLK (my Polish language)

Then engine threat all „ch” as single letter in the index?

And i cannot find looking by only „c” without „h” with index lookup?

If yes how engine decide that this is „ch” or  „c” and „h” in words?


This is one from dark, messy corners of IT like time zones...

First and foremost, Polish language AFAIK does not define "CH" asseparate letter (or use any other contractions), so this issue does notaffect PXW_PLK (or other Polish collations) at all.

Collations are primarily used to define order. Contractions andexpansions are special rules that allow to treat groups as units orunits as groups for ordering purposes.

For example, Czech "CH" letter is ordered between "H" and "I".Unfortunately, it's not defined as single letter in most character sets.Although there are charsets like KOI-8 CS2 that have it, it's notdefined in ones that are really used (i.e. pushed by dominant playerslike Windows OS). Czech IT settled on use of WIN1250 or ISO-8859-2 thatdoes not define "CH" as single letter, so proper order of "CH" dependson collation "contraction" rule.

Collation is used to produce "sortkeys", which is basicallytransformation of given key to binary sequence that could be compareddirectly (bytewise) with other sortkeys. In RDBMS, the sortkey is usedfor all sort operations, and for index keys (as they also rely oncomparison as it's binary tree). The same apply for comparisonpredicates (i.e. col >= value etc.).

The problem is, that when you look up for partial key that ends with acharacter that is start of contraction, you enter somewhat gray territory.

For example: The key "HROCH" (Czech word for hippopotamus) could havesortkey: 13+27+23+14 (each letter replaced with byte representing it'sorder place in alphabet, which is simplification for this showcase, i.e.13=H 27=R 23=O 14=CH)

If you would look for STARTING WITH "HROC" (or LIKE "HROC%"), you have apartial key that would transform to sortkey: 13+27+23+4 (4=C) which willnot match "HROCH", but match "HROCENI" (roaring). This is actuallycorrect from Czech language POV, but not from IT POV when you work oncharacter level of characters defined in charset.

Firebird currently "solves" that by removing the trailing partialcontraction from sortkey, i.e. it will return sortkey for "HRO" insteadfor "HROC", so keys like "HROCH" or "HROCENI" are matched. Sure, it maymatch also keys like "HROB" (grave), but these are latter eliminatedfrom result set by final expression evaluation. Remember, that indexlookup just collect candidates that are then read and verified byexpression evaluation (necessary step due to MGA).

As you can see, this approach leads to excess I/O. The scale ofadditional I/O depends on your data, and is highly influenced bycharacter set - UNICODE/UTF8 requires more storage space, which meansthat the same number of rows could occupy much more data pages (in somecases significantly more).

However, the Czech language defines only one contraction (CH), but thereare other languages that have many such as Hungarian.

Btw, if you are interested in collations, take a look athttps://collation-charts.org/


best regards
Pavel



Firebird-Devel mailing list, web interface at 
https://lists.sourceforge.net/lists/listinfo/firebird-devel

Re: [Firebird-devel] ODP: RFC: Fix for issue 6915

Reply via email to