[dev-platform] Intent to unship: Korean search collations

Henri Sivonen Mon, 03 Apr 2023 08:01:24 -0700

Bug: https://bugzilla.mozilla.org/show_bug.cgi?id=1739983

(See my previous email about how Intl.Collator(..., {usage: "search"}) is
little used and limited in terms of use cases due to doing only full-string
matching, which mainly enables filtering a set of strings by a search key
rather than finding the search key as a substring or prefix of a longer
string.)

For somewhat unfortunate reasons that are my fault, ICU4X doesn't properly
support search collation tailoring for Hangul: In ICU4X search collations,
Hangul uses sorting data, which means that to match a Hangul string, the
search key has to be canonically equivalent. Hangul is special, because
collation builds on normalization and Hangul is special in normalization.
(FWIW, this is a looser condition than what Firefox's ctrl-f/cmd-f imposes!
Firefox's ctrl-f/cmd-f requires the normalization to match as well. That
is, since input methods produce Hangul in Normalization Form C, Firefox
requires the page to be in NFC, too, for ctrl-f/cmd-f to work.)

I'd like to unship Korean search collations in the ICU4C context to see if
implementing support for them in the ICU4X context needs to be treated as a
blocker for Gecko switching its Intl.Collator back end from ICU4C to ICU4X.
Given the Web API surface available (full-string matching only, no prefix
or substring matching), the utility of the Korean search collations seems
questionable to me, which is why it doesn't seem like a good use of
engineering effort to make ICU4X support them if not supporting them turns
out to be feasible.

Additionally, this reduces libxul size on aarch64 Android by 200 KB on top
of the 152 KB reduction from the change contemplated in my previous email.
That is, a 352 KB reduction taken together.

The patch is easy to write. If others think this is acceptable to do, I
intend to pursue getting this change landed in August.

# Details

## What CLDR has

In CLDR, there are three search-specific special behavior for Hangul:

1. There exists a Korean-specific special search mode called searchjl. It
matches on the lead consonant of each syllable and ignores the vowel and
the possible trailing consonant of each syllable. As I understand it, this
mode originates from contact name search on pre-iOS/Android phones.

2. The plain search mode when the Korean language is requested allows
matching archaic Hangul with an ill-formed approximation written with a
modern-Hangul-only input method.

3. The search root contains data analogous to the previous item for modern
Hangul only allowing well-formed modern Hangul to be matched by ill-formed
input where double letters have been typed by pressing the corresponding
key twice without the shift key instead of being typed normally by pressing
the corresponding key once with the shift key pressed.

## Why they seem questionable

### Item 1, searchjl

For the use case of quickly filtering an address book view, it makes sense
to try matching the needle as a _prefix_ of each name in the address book.
However, the Web API only supports full-string matching, which makes the
use case implausible in the context of the available API.

Furthermore, there were _zero_ uses of searchjl in the HTTP Archive data
set.

However, it's unclear how well HTTP Archive covers the Korean-language part
of the Web. Also, one would expect filtering an address book to be behind
login, and HTTP Archive crawls the public Web. However, many
login-requiring sites advertise their JavaScript bundle already on the
login page and, therefore, HTTP Archive does pick up JavaScript code that
activates behind login.

### Item 2, matching archaic Hangul with an ill-formed approximation
written with a modern-only input method when the Korean language is
requested

As "archaic" suggests, archaic Hangul is not used for present-day Korean
text and is of relevance to scholarly use. For it to be relevant in the
context of the Web API that only allows for full-string matching, there
would need to be a set of archaic Hangul strings to be filtered by a search
key typed with a modern-Hangul-only input method. This is less plausible as
a use case than being able to ctrl-f/cmd-f over a digitized historical
document in apps that use the data for substring search (i.e. Chrome and
Safari UI but _not_ the Web API), which I understand to motivate the
existence of the data in CLDR.

Arguably, it's a layering violation for the search data to address input
method concerns like this, but then one might argue that
diacritic-insensitive search for the Latin script is about addressing an
input method concern, too. AFAICT, Windows comes with an input method for
archaic Hangul but other popular operating systems do not.

### Item 3, matching well-formed modern Hangul with an ill-formed
approximation typed without the shift key in the context of all languages

This feature makes no sense to me. I can't infer a legitimate use case, and
no one has told me when I've asked. My inference is that this data exists
in the first place for completeness so that the way of approximating
archaic Hangul also works for syllables that remain in modern Hangul. The
archaic Hangul data is logically script-level data, but it is placed in a
language-specific tailoring. This does not make sense as a matter of
principle. My inference is that putting all the Hangul data in the search
root would have made the other search collations, which each contain a
_copy_ of the search root, too large so the principle of putting
script-level things in the search root was violated and as a compromise the
modern Hangul data was left in the search root and the archaic Hangul data
was pushed to the Korean tailoring even though the data left is the search
root is just weird on its own. (I haven't been able to get access to the
minutes of the meeting from over a decade ago where the split was decided.)

If my inference is incorrect and/or if there is a legitimate
modern-Hangul-related use case for the Hangul data in the search root,
please let me know.

## Alternatives

An approximation of searchjl would fit ICU4X, so an alternative would be to
modify searchjl instead of removing it. Specifically, by omitting the lines
https://github.com/unicode-org/icu/blob/64b35481263ac4df37a28a9c549553ecc9710db2/icu4c/source/data/coll/ko.txt#L369-L379
searchjl would fit in the ICU4X data format. It's unclear to me what the
user-visible purpose of those lines is. (If I'm reading those lines
correctly, the patterns don't occur in well-formed modern Hangul text and
don't occur when typing a consonant-only sequence using an IME.)

## Standards

I've been told that customizing the collation data is permitted by relevant
standards. An anything-goes position in standards isn't particularly
interesting for the purpose of deciding whether it's a good idea to do
this, though.

## Test suites

Test suites that I'm aware of don't test this and, therefore, wouldn't fail.

## Other browsers

I don't expect Chrome or Safari to do this. I gather that the capability to
search archaic Hangul with a modern-only input method comes from work on
Chrome's ctrl-f/cmd-f feature from the time before the Blink fork from
WebKit. Furthermore, WebKit on Apple operating systems uses the system copy
of ICU4C. (Firefox's ctrl-f/cmd-f isn't collator-based and doesn't allow
for searching archaic Hangul with a modern-only input method.)

## Platforms

All.

--
Henri Sivonen
[email protected]

--
You received this message because you are subscribed to the Google Groups
"[email protected]" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/a/mozilla.org/d/msgid/dev-platform/CAJHk%2B8StmrWLTmf9s8YtJAiOozjXy0Dr7c23jXW5kpuH_toJTg%40mail.gmail.com.

[dev-platform] Intent to unship: Korean search collations

Reply via email to