Re: [PHP-DEV] [Discussion] Scalar Object Strings and MultibyteEncodings

2019-06-22 Thread Rowan Collins

On 20/06/2019 23:30, Mark Randall wrote:
There does at least seem to be the starting point in that mb_string is 
already widely used, and my suggestion that it "work as expected" is 
more that it would work as the equivalent mb_string / iconv function 
would.



I think this is a rather short-sighted way of looking at it. If people 
want the API provided by the mbstring extension, they can just use those 
functions; the advantage of designing a new set of functions is surely 
that we don't need to stick to past decisions. If we start to build a 
new standard library, as Zeev suggested in the deprecation thread, it is 
a once-in-a-lifetime chance to build something better, not just copy 
what's gone before.



mb_strlen returns the number of codepoints for example, I'm not 
immediately seeing anything about mb_string supporting Graphemes as 
the only reference I could find to their manipulation was The intl 
extension.



The mbstring extension was not built for Unicode, but for older Japanese 
multi-byte encodings, where the definition of "character" is much more 
straight-forward. Its Unicode support seems to mostly see code points as 
mappings for characters in some other encoding. (The oldest manual page 
for it on archive.org [1] is from 2001, and includes the quaint remark 
"As Unicode is getting popular, UTF-8 is used also.") The iconv library 
is even more explicitly aimed at converting between character sets, 
rather than understanding them (the extra functions such as iconv_strlen 
are unique to PHP).


Unicode today is much more than a mapping of legacy encodings to a 
universal character set, and I can think of no useful purpose in 
declaring the "string length" of the British flag emoji to be 2, just 
because it is encoded as the sequence U+1F1EC U+1F1E7.



[1] 
http://web.archive.org/web/20010605075550/http://www.php.net/manual/en/ref.mbstring.php


Regards,

--
Rowan Collins
[IMSoP]


--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] [Discussion] Scalar Object Strings and MultibyteEncodings

2019-06-20 Thread Mark Randall

On 20/06/2019 22:19, Rowan Collins wrote:

On 20/06/2019 16:36, Mark Randall wrote:
"Hello".substr(1) // would work as expected regardless of encoding 


As I always point out when "multi-byte support" or "Unicode support" is 
discussed, it's often ambiguous just what should be "expected".


My point is that any attempt to make the language "do the right thing by 
default" needs serious thought on what "the right thing" is.


Without a doubt, and I expect people will have terrible flashbacks to 
PHP6 discussions when thinking about it. It will require a consensus of 
which I have no power to aid or influence.


There does at least seem to be the starting point in that mb_string is 
already widely used, and my suggestion that it "work as expected" is 
more that it would work as the equivalent mb_string / iconv function would.


mb_strlen returns the number of codepoints for example, I'm not 
immediately seeing anything about mb_string supporting Graphemes as the 
only reference I could find to their manipulation was The intl extension.


--
Mark Randall

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php