On 7/2/24 13:43, Ayesh Karunaratne wrote:

Hi Tim,
Now that the RFC is restarted, could you mention some examples in Georgian that might be good test cases?

I was thinking there might be some good test cases in Turkish, but couldn't find any. The RFC has examples (https://github.com/php/php-src/pull/13161) in Vietnamese, but they are correct for both "uppercase first character" and titlecase conversions.

Any Georgian word would do. Your ASCII test case is "abc". The Georgian equivalent for that would be "აბგ" (ani bani gani, U+10D0 U+10D1 U+10D2) which should remain the same after passing through mb_ucfirst(). Compare mb_strtoupper("აბგ") -> "ᲐᲑᲒ" (U+1C90 U+1C91 U+1C92).

On the task I mentioned that ligatures are also affected. I gave the example mb_ucfirst("lj") -> "Lj", that is, U+01C9 -> U+01C8. You could add a test case for that. Compare mb_strtoupper("lj") -> "LJ" (U+01C7).

To repeat my rationale -- we can view ucfirst() either through a technical lens (convert the first character of a string to upper case) or through a natural language lens (convert a string to sentence case, with the initial letter capitalised per local conventions). I am arguing to make mb_ucfirst() be a natural language extension of ucfirst(), because applying the technical extension would produce results that look quite jarring in a natural language context.

There are some edge cases which are not quite right. To really do a good job, a new case map will be needed. But if we document it as being for natural language, and set the right expectations, we can fix the edge cases later.

-- Tim Starling

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: https://www.php.net/unsub.php

Reply via email to