A notable reduction in libxul size (reduction of 544 KB on aarch64 Android)
will be observable in the next Nightly if all goes well. I'm sending this
so that people don't need to investigate where the libxul size change comes
from.

The change comes from https://bugzilla.mozilla.org/show_bug.cgi?id=1793749
and https://bugzilla.mozilla.org/show_bug.cgi?id=1630920 both of which
implement behaviors that Chrome already has. These changes also reflect the
(upcoming) ICU4X defaults.

(TL;DR ends here; long version follows)

Bug 1630920 removes the zh-u-co-big5han and zh-u-co-gb2312 collations for
consistency with Chrome. Chrome excludes these with the comment: "big5han
and gb2312han collation do not make any sense and nobody uses them."
Therefore, Web authors cannot rely on them being present cross-browser
anyway.

>From source code archeology, I infer that long ago ICU initially got its
first Traditional Chinese and Simplified Chinese collations created in the
same manner as the Japanese collation: The order of the legacy coded
character sets. When more appropriate default collations were introduced,
(by stroke count for Traditional Chinese and by Pinyin for Simplified
Chinese), the collations that were already in ICU got renamed instead of
getting removed. There is now an issue open to remove the legacy coded
character set-based ones from CLDR:
https://unicode-org.atlassian.net/browse/CLDR-16062

Bug 1793749 changes the _root_ collation to use implicit rather than
explicit ordering for Han characters. This change is in principle a
reduction in correctness, but Chrome is already shipping this reduction in
correctness, so Web authors cannot rely on the more-correct-in-principle
behavior across browsers anyway.

Copypaste from the bug:

ICU supports two variants of the root collation: unihan and implicithan.

unihan puts all Han characters across blocks of different ages into unified
radical-stroke order, which is theoretically proper but involves explicit
data.

implicithan explicitly orders the blocks (main ideograph block before
Extension A, even though Extension A comes first in code point order) and
then within each block implies the order from the codepoint (radical-stroke
within each block), which is OK enough in practice and involves less data.

The reason why implicithan is OK enough in practice is two-fold:

   1. None of the CJK locales use the root order for common characters in
   the respective languages. They all use tailorings, so unihan vs.
   implicithan is relevant only for the purpose of giving *some* order to
   characters that are so rare that the language-specific tailoring doesn't
   cover them.
   2. Since each block, including the main ideographic block, is internally
   ordered by radical-stroke, the difference is irrelevant to comparison of
   characters that are common enough to be covered by the main ideographic
   block.

- -

P.S. Chrome also excludes zh-u-co-unihan without excluding ja-u-co-unihan
or ko-u-co-unihan. I have not aligned Firefox on this, because I'm not
completely convinced about what's appropriate. So far, however, I am
unaware of any app using *-u-co-unihan collation orders for anything other
than building human-browsable lookup indexes for dictionaries. (As opposed
to sorting search results.) The Web Platform does not currently provide an
API for generating a bucketed index (for English, you'd have buckets for
each letter from A to Z, for *-u-co-unihan, you'd have a bucket for each
radical) and it's unclear if *-u-co-unihan index generation even works
properly with the implicithan root due to the way a couple of the bucket
reference points attach to characters from outside the main ideographic
block.

If you are curious about probing a given browser, you can use
https://hsivonen.com/test/moz/zh-collations.html ("cjk" would be a more
proper name than "zh", but by the time I added Japanese and Korean tests,
there were already links to that URL out there.)

-- 
Henri Sivonen
[email protected]

-- 
You received this message because you are subscribed to the Google Groups 
"[email protected]" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/a/mozilla.org/d/msgid/dev-platform/CAJHk%2B8Q69czO71VhktvcZGRNT85x0GaWk5fP5dnZkHp25BnYwQ%40mail.gmail.com.

Reply via email to