Hi Binarus, On 03/11/2016 09:25 PM, Binarus wrote: > On 11.03.2016 15:45, Alexander Barkov wrote: >> FYI, I have added a new task for this: >> >> https://jira.mariadb.org/browse/MDEV-9711 >> > > Alexander, not sure if my question makes any sense, but what about the _cs > collations? Like latin1_general_cs, latin2_czech_cs? Don't we like to have > these, too?
We do, eventually. I just wanted to split the task into these steps: 1. Default and _bin collations. This is what MDEV-9711 is about. 2. Unicode Collation Algorithm (UCA) based collations (e.g. utf8_unicode_nopad_ci) 3. The rest (not covered by the previous steps) This will include latin1_general_nopad_cs, latin2_czech_nopad_cs This is why asked in the previous letter if utf8_unicode_nopad_ci would work for you. Also, which MariaDB version do you need this for? Will 10.1 work? > > Regards, > > Binarus > > >> >> On 03/11/2016 03:20 PM, Alexander Barkov wrote: >>> Hello Binarus, Kristian, >>> >>> On 03/11/2016 02:25 PM, Kristian Nielsen wrote: >>>> Binarus <li...@binarus.de> writes: >>>> >>>>> "All MySQL collations are of type PADSPACE. This means that all CHAR, >>>>> VARCHAR, and TEXT values in MySQL are compared without regard to any >>>>> trailing spaces. “Comparison” in this context does not include the >>>> >>>> Yes, I have always found this terminally stupid as well. But I think it >>>> comes from the SQL standard. >>> >>> Probably this comes from here: >>> >>>> 2) A <standard character set name> specifies the name of a character set >>>> that is defined by a national or >>>> international standard. The character repertoire of CS is defined by the >>>> standard defining the character set >>>> identified by that <standard character set name>. The default collation of >>>> the character set is defined by >>>> the order of the characters in the standard and has the PAD SPACE >>>> characteristic. >>> >>> So we follow the PAD SPACE requirement for default collations here. >>> >>> I guess the reasoning here is to treat CHAR and VARCHAR in a similar way >>> by default. I agree with the standard here. We don't want different >>> behavior for CHAR and VARCHAR. >>> >>> >>> Btw, we don't follow the requirement that the default collation must be >>> defined according to "the order of the characters in the standard". >>> Default collations are traditionally case insensitive in MariaDB/MySQL. >>> >>>> >>>> The only workaround I know of is to use VARBINARY instead of VARCHAR. I >>>> think it works much the same in most respects. But obviously some semantics >>>> is lost when the server no longer is aware of the character set used. >>> >>> Correct. >>> >>>> >>>>> Since the index behaviour obviously depends on the collation, would >>>>> building an own collation which does not PADSPACE be an option? I have >>>> >>>> That would be interesting, actually. I don't know what support there is for >>>> non-PADSPACE collations. Maybe bar knows (Cc:'ed)? >>> >>> We don't have NO PAD collations yet. >>> I don't remember that anybody ever asked for this before. >>> >>> From a glance, this should not be too hard to implement. >>> >>> You previously wrote you need utf8. >>> So there are three options to make a new collation, >>> based on one of these existing collations: >>> >>> - utf8_general_ci >>> - utf8_unicode_ci >>> - utf8_bin >>> >>> but with NO PAD characteristics. >>> >>> Suppose you need utf8_general_nopad_ci >>> (i.e. based on utf8_general_ci) >>> >>> >>> utf8_general_ci is implemented in strings/ctype-utf8.c >>> >>> A new collation handler should be added, similar to this one: >>> >>> static MY_COLLATION_HANDLER my_collation_utf8_general_ci_handler = >>> { >>> NULL, /* init */ >>> my_strnncoll_utf8_general_ci, >>> my_strnncollsp_utf8_general_ci, >>> my_strnxfrm_unicode, >>> my_strnxfrmlen_unicode, >>> my_like_range_mb, >>> my_wildcmp_utf8, >>> my_strcasecmp_utf8, >>> my_instr_mb, >>> my_hash_sort_utf8, >>> my_propagate_complex >>> }; >>> >>> >>> but these three virtual functions must be redefined to new >>> similar functions that do not ignore trailing spaces: >>> >>> - my_strnncollsp_utf8_general_ci - this is used for BTREE indexes >>> - my_hash_sort_utf8 - this is used for HASH indexes >>> - my_strnxfrm_unicode - this is used for filesort >>> (non-indexed ORDER BY) >>> >>> All other functions can be reused from the existing PAD SPACE collation. >>> >>> >>> >>> Then a new "struct charset_info_st" should be defined, >>> similar to my_charset_utf8_general_ci. >>> >>> Sounds like a few hours of work. >>> >>> >>> But then thorough testing will be needed, >>> which will be the most time consuming part. >>> >>> >>> Also, for consistency, it's worthy to implement >>> at least utf8_nopad_bin and utf8_unicode_nopad_ci at once, >>> and then eventually NO PAD collations for all other character sets. >>> >>>> >>>> - Kristian. >>>> > > _______________________________________________ > Mailing list: https://launchpad.net/~maria-discuss > Post to : maria-discuss@lists.launchpad.net > Unsubscribe : https://launchpad.net/~maria-discuss > More help : https://help.launchpad.net/ListHelp > _______________________________________________ Mailing list: https://launchpad.net/~maria-discuss Post to : maria-discuss@lists.launchpad.net Unsubscribe : https://launchpad.net/~maria-discuss More help : https://help.launchpad.net/ListHelp