Do we care enough about libxul size that we want to take an easy-to-write
patch that would reduce Web-exposed correctness a little (in principle but
most likely not in a user-relevant way judging from the HTTP Archive data
set) in order to reduce libxul size by 152 KB?

Intl.Collator(..., {usage: "search"}) is an API for fuzzy (in the context
of the Latin script, case and diacritic-insensitive) filtering a set of
strings by a search key that is expected to have come from user input. The
API allows for full-string matching only (not substring search), which
limits the API to filtering a set of strings by testing each one against
the search key, which means that the addressable use cases are pretty
narrow. Based on HTTP Archive, almost all instances of calling code on the
Web are traceable to a single PR:
https://github.com/mapbox/mapbox-gl-js/pull/6270 .

Some Latin-script languages[1] have language-specific exceptions to
diacritic-insensitivity. There exist also script-level fuzziness rules for
the Arabic script (to be insensitive to certain Arabic marks) and the Thai
script (to be insensitive to phinthu/virama) (and Hangul, but that's a
different story; see my next email). Additionally, there is a conceptually
"script-level" rule for symbols that makes the not-equals sign _not_ match
the equals sign in diacritic-insensitive matching.

The language-specific rules and the script-level rules combine really badly
in terms of data size. What's originally supposed to be sorting data reuse
"for free" ends up growing libxul by 152 KB.

We could make libxul 152 KB smaller by not having the Arabic and
Thai-script fuzziness rules (and the not-equals _un_fuzziness rule) apply
when language-specific Latin-script rules are in effect.

If we were to do this, it would apply to all platforms (not just Android)
so as not to add build system complications.

Eventually, we could have a size reduction without a Web-exposed behavior
change by migrating to ICU4X *and* implementing
https://github.com/unicode-org/icu4x/issues/3178 on the ICU4X side.

[1]:
 * Azeri
 * Catalan
 * Danish
 * Faroese
 * Finnish
 * German
 * Greenlandic
 * Hungarian
 * Icelandic
 * Inari Sámi
 * North Sámi
 * Norwegian
 * Slovak
 * Spanish
 * Swedish
 * Turkish

Notably, English (the common fallback language) and French (used in a
number of places where Arabic is also used) are _not_ on this list and,
therefore, would still get the Arabic-script and Thai-script fuzziness.

-- 
Henri Sivonen
[email protected]

-- 
You received this message because you are subscribed to the Google Groups 
"[email protected]" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/a/mozilla.org/d/msgid/dev-platform/CAJHk%2B8ShX1gQb2zbGUnUCo3ty8MFys%2B1MDaFDGyAkM1SeAfNnA%40mail.gmail.com.

Reply via email to