[dev-platform] Re: Intent to unship: Korean search collations

'Henri Sivonen' via [email protected] Thu, 12 Mar 2026 01:46:05 -0700

The exact mechanics didn't end up matching the path that was contemplated,
but it terms of what can be observed by querying the JavaScript APIs, the
unshipping is now on Nightly.


On Mon, Apr 3, 2023 at 6:00 PM Henri Sivonen <[email protected]> wrote:

> Bug: https://bugzilla.mozilla.org/show_bug.cgi?id=1739983
>
> (See my previous email about how Intl.Collator(..., {usage: "search"}) is
> little used and limited in terms of use cases due to doing only full-string
> matching, which mainly enables filtering a set of strings by a search key
> rather than finding the search key as a substring or prefix of a longer
> string.)
>
> For somewhat unfortunate reasons that are my fault, ICU4X doesn't properly
> support search collation tailoring for Hangul: In ICU4X search collations,
> Hangul uses sorting data, which means that to match a Hangul string, the
> search key has to be canonically equivalent. Hangul is special, because
> collation builds on normalization and Hangul is special in normalization.
> (FWIW, this is a looser condition than what Firefox's ctrl-f/cmd-f imposes!
> Firefox's ctrl-f/cmd-f requires the normalization to match as well. That
> is, since input methods produce Hangul in Normalization Form C, Firefox
> requires the page to be in NFC, too, for ctrl-f/cmd-f to work.)
>
> I'd like to unship Korean search collations in the ICU4C context to see if
> implementing support for them in the ICU4X context needs to be treated as a
> blocker for Gecko switching its Intl.Collator back end from ICU4C to ICU4X.
> Given the Web API surface available (full-string matching only, no prefix
> or substring matching), the utility of the Korean search collations seems
> questionable to me, which is why it doesn't seem like a good use of
> engineering effort to make ICU4X support them if not supporting them turns
> out to be feasible.
>
> Additionally, this reduces libxul size on aarch64 Android by 200 KB on top
> of the 152 KB reduction from the change contemplated in my previous email.
> That is, a 352 KB reduction taken together.
>
> The patch is easy to write. If others think this is acceptable to do, I
> intend to pursue getting this change landed in August.
>
>
> # Details
>
> ## What CLDR has
>
> In CLDR, there are three search-specific special behavior for Hangul:
>
> 1. There exists a Korean-specific special search mode called searchjl. It
> matches on the lead consonant of each syllable and ignores the vowel and
> the possible trailing consonant of each syllable. As I understand it, this
> mode originates from contact name search on pre-iOS/Android phones.
>
> 2. The plain search mode when the Korean language is requested allows
> matching archaic Hangul with an ill-formed approximation written with a
> modern-Hangul-only input method.
>
> 3. The search root contains data analogous to the previous item for modern
> Hangul only allowing well-formed modern Hangul to be matched by ill-formed
> input where double letters have been typed by pressing the corresponding
> key twice without the shift key instead of being typed normally by pressing
> the corresponding key once with the shift key pressed.
>
> ## Why they seem questionable
>
> ### Item 1, searchjl
>
> For the use case of quickly filtering an address book view, it makes sense
> to try matching the needle as a _prefix_ of each name in the address book.
> However, the Web API only supports full-string matching, which makes the
> use case implausible in the context of the available API.
>
> Furthermore, there were _zero_ uses of searchjl in the HTTP Archive data
> set.
>
> However, it's unclear how well HTTP Archive covers the Korean-language
> part of the Web. Also, one would expect filtering an address book to be
> behind login, and HTTP Archive crawls the public Web. However, many
> login-requiring sites advertise their JavaScript bundle already on the
> login page and, therefore, HTTP Archive does pick up JavaScript code that
> activates behind login.
>
> ### Item 2, matching archaic Hangul with an ill-formed approximation
> written with a modern-only input method when the Korean language is
> requested
>
> As "archaic" suggests, archaic Hangul is not used for present-day Korean
> text and is of relevance to scholarly use. For it to be relevant in the
> context of the Web API that only allows for full-string matching, there
> would need to be a set of archaic Hangul strings to be filtered by a search
> key typed with a modern-Hangul-only input method. This is less plausible as
> a use case than being able to ctrl-f/cmd-f over a digitized historical
> document in apps that use the data for substring search (i.e. Chrome and
> Safari UI but _not_ the Web API), which I understand to motivate the
> existence of the data in CLDR.
>
> Arguably, it's a layering violation for the search data to address input
> method concerns like this, but then one might argue that
> diacritic-insensitive search for the Latin script is about addressing an
> input method concern, too. AFAICT, Windows comes with an input method for
> archaic Hangul but other popular operating systems do not.
>
> ### Item 3, matching well-formed modern Hangul with an ill-formed
> approximation typed without the shift key in the context of all languages
>
> This feature makes no sense to me. I can't infer a legitimate use case,
> and no one has told me when I've asked. My inference is that this data
> exists in the first place for completeness so that the way of approximating
> archaic Hangul also works for syllables that remain in modern Hangul. The
> archaic Hangul data is logically script-level data, but it is placed in a
> language-specific tailoring. This does not make sense as a matter of
> principle. My inference is that putting all the Hangul data in the search
> root would have made the other search collations, which each contain a
> _copy_ of the search root, too large so the principle of putting
> script-level things in the search root was violated and as a compromise the
> modern Hangul data was left in the search root and the archaic Hangul data
> was pushed to the Korean tailoring even though the data left is the search
> root is just weird on its own. (I haven't been able to get access to the
> minutes of the meeting from over a decade ago where the split was decided.)
>
> If my inference is incorrect and/or if there is a legitimate
> modern-Hangul-related use case for the Hangul data in the search root,
> please let me know.
>
>
> ## Alternatives
>
> An approximation of searchjl would fit ICU4X, so an alternative would be
> to modify searchjl instead of removing it. Specifically, by omitting the
> lines
> https://github.com/unicode-org/icu/blob/64b35481263ac4df37a28a9c549553ecc9710db2/icu4c/source/data/coll/ko.txt#L369-L379
> searchjl would fit in the ICU4X data format. It's unclear to me what the
> user-visible purpose of those lines is. (If I'm reading those lines
> correctly, the patterns don't occur in well-formed modern Hangul text and
> don't occur when typing a consonant-only sequence using an IME.)
>
>
> ## Standards
>
> I've been told that customizing the collation data is permitted by
> relevant standards. An anything-goes position in standards isn't
> particularly interesting for the purpose of deciding whether it's a good
> idea to do this, though.
>
>
> ## Test suites
>
> Test suites that I'm aware of don't test this and, therefore, wouldn't
> fail.
>
>
> ## Other browsers
>
> I don't expect Chrome or Safari to do this. I gather that the capability
> to search archaic Hangul with a modern-only input method comes from work on
> Chrome's ctrl-f/cmd-f feature from the time before the Blink fork from
> WebKit. Furthermore, WebKit on Apple operating systems uses the system copy
> of ICU4C. (Firefox's ctrl-f/cmd-f isn't collator-based and doesn't allow
> for searching archaic Hangul with a modern-only input method.)
>
>
> ## Platforms
>
> All.
>
> --
> Henri Sivonen
> [email protected]
>


-- 
Henri Sivonen
[email protected]

-- 
You received this message because you are subscribed to the Google Groups 
"[email protected]" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion visit 
https://groups.google.com/a/mozilla.org/d/msgid/dev-platform/CAJHk%2B8R9MNC8Gq-R69ezGFN4Vm4Y2_voCmYuG%2B-SyTnfsG%3DNmA%40mail.gmail.com.

[dev-platform] Re: Intent to unship: Korean search collations

Reply via email to