Re: [Langcom] [i18n software news] Collation for Bashkir

Michael Everson Tue, 13 Jun 2017 10:21:52 -0700

HUZZAH!

> On 13 Jun 2017, at 07:09, Amir E. Aharoni <[email protected]> 
> wrote:
> 
> Hi,
> 
> Another edition of i18n software news!
> 
> Yesterday, a change was deployed in the Bashkir Wikipedia: The categories are 
> now sorted in the correct alphabetical order.
> 
> Bashkir, like many languages of the Soviet Union, uses the Cyrillic alphabet 
> with several extra characters. Without proper software support, the extra 
> letters are sorted according according to their Unicode character number 
> order, which is not very useful. For example, the letter Ө is supposed to be 
> in the middle of the alphabet between О and П, but without correct collation 
> it's in the end, so Ufa (Өфө), the capital of Bashkortostan, appears in the 
> very end of the alphabet in the "Capitals of Russian regions" category [1] , 
> but now it appears correctly before П.
> 
> This could be resolved by adding the collation for this language to CLDR and 
> ICU, and I filed a ticket about this with CLDR [2]. Actually getting it added 
> and deployed is a long process, but the MediaWiki developer Brian Wolff 
> provided a good interim solution in MediaWiki code itself. The infrastructure 
> code around it is surprisingly tricky, but to simply add a new alphabet, you 
> just need to create a file like this:
> https://phabricator.wikimedia.org/source/mediawiki/browse/master/includes/collation/BashkirUppercaseCollation.php
> 
> When it is added to CLDR and ICU, this stopgap solution can be removed from 
> MediaWiki.
> 
> As far as I can see, Bashkir is the first language for which such a 
> comprehensive solution was made inside MediaWiki, and it is needed for many 
> others. I'll start looking for other languages where this is needed. My 
> process would be something like this:
> 1. Find a languages in which there is a Wikipedia with incorrect collation.
> 2. Find the correct alphabetical order, using a grammar book or a dictionary, 
> and confirm it with editors in that language.
> 3. Submit a ticket to CLDR.
> 4. Add a file with an alphabet, like the Bashkir file above, to MediaWiki 
> core.
> 5. Get it reviewed, merged, and deployed.
> 6. Deploy the change to the projects in that language.
> 7. Run a script that converts the categories to the new collation.
> 
> (Steps 5 and 6 sound repetitive because it needs to explicitly enabled for 
> each wiki. I filed another bug [4], which suggests defining a default 
> collation per language, so that step 6 won't be needed.)
> 
> If anybody has better suggestions about working with CLDR and ICU and getting 
> them to add and release these collation files faster, I'll be very happy to 
> hear them.
> 
> [1] http://bit.ly/2sWLJaX
> [2] http://unicode.org/cldr/trac/ticket/10195
> [3] For the confirmation about Bashkir see 
> https://phabricator.wikimedia.org/T162823 .
> [4] https://phabricator.wikimedia.org/T164985
> 
> --
> Amir Elisha Aharoni · אָמִיר אֱלִישָׁע אַהֲרוֹנִי
> http://aharoni.wordpress.com
> ‪“We're living in pieces,
> I want to live in peace.” – T. Moore‬
> _______________________________________________
> Langcom mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/langcom



_______________________________________________
Langcom mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/langcom

Re: [Langcom] [i18n software news] Collation for Bashkir

Reply via email to