On 7/2/24 13:43, Ayesh Karunaratne wrote:
Hi Tim,
Now that the RFC is restarted, could you mention some examples in
Georgian that might be good test cases?
I was thinking there might be some good test cases in Turkish, but
couldn't find any. The RFC has examples
(https://github.com/php/php-src/pull/13161) in Vietnamese, but they
are correct for both "uppercase first character" and titlecase
conversions.
Any Georgian word would do. Your ASCII test case is "abc". The
Georgian equivalent for that would be "აბგ" (ani bani gani, U+10D0
U+10D1 U+10D2) which should remain the same after passing through
mb_ucfirst(). Compare mb_strtoupper("აბგ") -> "ᲐᲑᲒ" (U+1C90 U+1C91
U+1C92).
On the task I mentioned that ligatures are also affected. I gave the
example mb_ucfirst("lj") -> "Lj", that is, U+01C9 -> U+01C8. You could
add a test case for that. Compare mb_strtoupper("lj") -> "LJ" (U+01C7).
To repeat my rationale -- we can view ucfirst() either through a
technical lens (convert the first character of a string to upper case)
or through a natural language lens (convert a string to sentence case,
with the initial letter capitalised per local conventions). I am
arguing to make mb_ucfirst() be a natural language extension of
ucfirst(), because applying the technical extension would produce
results that look quite jarring in a natural language context.
There are some edge cases which are not quite right. To really do a
good job, a new case map will be needed. But if we document it as
being for natural language, and set the right expectations, we can fix
the edge cases later.
-- Tim Starling
--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: https://www.php.net/unsub.php