Re: [PHP-DEV] Multibyte strings
good morning, On Sat, Feb 12, 2022, 3:47 AM Rowan Tommins wrote: > On 11/02/2022 18:42, Michał wrote: > > Considering the given example, the description from the documentation > > of strlen function: "Returns the length of the given string". > > > Which is exactly what it does. Using Unicode terminology [see > https://unicode.org/glossary], here are a few different things you could > count to determine the "length" of a string: > > a) bits > b) bytes > c) code units (UTF-16 has code units of 16 bits, UTF-8 has code units of > 8 bits) > d) code points (one of 1,112,064 numbers that can be given a meaning by > the Unicode standard) > e) graphemes (what a user would generally think of as a "character") > f) pixels (or any other unit of physical size) > it is why we have intl, which uses the ICU and allow users to update it. That means using the latest standard if needed. best, Pierre >
Re: [PHP-DEV] Multibyte strings
On 11/02/2022 18:42, Michał wrote: Considering the given example, the description from the documentation of strlen function: "Returns the length of the given string". Which is exactly what it does. Using Unicode terminology [see https://unicode.org/glossary], here are a few different things you could count to determine the "length" of a string: a) bits b) bytes c) code units (UTF-16 has code units of 16 bits, UTF-8 has code units of 8 bits) d) code points (one of 1,112,064 numbers that can be given a meaning by the Unicode standard) e) graphemes (what a user would generally think of as a "character") f) pixels (or any other unit of physical size) mb_strlen() will measure (d), which is frankly pretty useless - do you really need to know that "noél" is 5 code points long, but "noél" is only 4? (The first uses a combining diacritic, the other a pre-composed accented letter.) Much more often you want strlen() to tell you (a) - one will take up 6 bytes of storage and the other only 5; or grapheme_strlen() to tell you (e) - both have 4 graphemes. The same goes for the "mb_strcut" function mentioned by Mel Dafert; try running this: echo mb_strcut('noél', 3, 3, 'UTF-8'); https://3v4l.org/s2SsR The algorithm "correctly" keeps all the bytes of the acute accent, but drops the "e" it was on top of; probably not a very useful result. And that's before we get to functions which should behave differently in different languages, like correctly capitalising "i" in Turkish: https://en.wikipedia.org/wiki/Dotted_and_dotless_I Doing this stuff right is really, really difficult; and that is the reason it doesn't just "work out of the box". Regards, -- Rowan Tommins [IMSoP] -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: https://www.php.net/unsub.php
Re: [PHP-DEV] Multibyte strings
On 11 February 2022 07:26:45 CET, "Michał" wrote: >Hi everyone. >It's a known fact that nowadays most websites use at least UTF-8 >encoding. Unfortunately PHP itself has stopped a bit in the previous >century. Is there any reason why the mbstring extension cannot be >introduced to core in the next major version (maybe preceded with a >deprecation message like it was with the mysql extension in v5)? All >functions from the standard library would become aliases for multibyte >equivalents. As others have said, any change to behaviour in something as subtle as string encoding makes little sense (see PHP 6 or the mess that was the migration from Python 2 to 3, which did exactly that). However, I do see an argument to be made to make the mbstring extension always available, similar to what was done with the json extension [1]. Currently, one cannot assume to have access to things like mb_strcut, which makes writing code that does not break when it's fed UTF-8 relatively complicated. Frameworks like Drupal also require mbstring for anything other than English content [2]. The manual [3] also says that it does not require any external libraries, so there does not seem to be any technical obstacle either. Would that be an option? Or am I missing some obvious reason that mbstring should not be always available, like licensing issues? Regards, Mel [1] https://wiki.php.net/rfc/always_enable_json [2] https://www.drupal.org/docs/system-requirements/php-requirements#s-mbstring- [3] https://www.php.net/manual/en/mbstring.requirements.php -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: https://www.php.net/unsub.php
Re: [PHP-DEV] Multibyte strings
This++. Unicode is not a static standard definition of all characters. New emoji are being added to the specification daily and while a glyph like might look like a single "character" to a set of human eyes, and indeed in Unicode 6.0 is a single codepoint (U+1F46A), prior to Unicode 6.0 (and still FTR) it was still expressible using Zero Width Joining as five separate code points: [MAN][WZJ][WOMAN][WZJ][BOY] which mb_strlen() will tell you is five "characters" long, despite being visible as a single grapheme. Okay, so we look at the ICU grapheme functions, but depending on what version of the Unicode database is installed, that answer may be five or one. In short: Language is complicated and there's not a one-size-fits-all solution. -Sara Thank You Sara for a great example. I didn't know that the topic was covered in PHP6. -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: https://www.php.net/unsub.php
Re: [PHP-DEV] Multibyte strings
W dniu 11.02.2022 o 16:41, Kirill Nesmeyanov pisze: ``` $string = ‘Hell or world!’; echo ‘Bytes: ’ . \strlen($string) . "\n"; echo ‘Chars: ‘ . \mb_strlen($string); ``` Thanks Kirill for Your answer. I totally agree that stream and text functions are two different things. However, in the context of cleaning up the PHP language, the inconsistency is very disturbing. Considering the given example, the description from the documentation of strlen function: "Returns the length of the given string". Only below that you can find the note that function "returns the number of bytes". So strlen is in the virtual namespace String (String functions), its description says that it should return the length of the string, but if you specify a multibyte string it returns the number of bytes, not the number of characters. In that case there should be a bytes_length function, or something like Stream::fromString(string $string)->getSize(); (StreamInterface from PSR-7 is also a great example). So, using the example given, a natural and logical approach would be: ``` $string = ‘Hell or world!’; echo ‘Bytes: ’ . \bytes_length($string) . "\n"; echo ‘Chars: ‘ . \strlen($string); // in that case alias for mb_strlen ``` -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: https://www.php.net/unsub.php
Re: [PHP-DEV] Multibyte strings
On Fri, Feb 11, 2022 at 3:14 AM Rowan Tommins wrote: > There's also I think a myth in people's minds that something like > "string length" has a single meaning, and PHP gets it "wrong" for > multibyte strings; > This++. Unicode is not a static standard definition of all characters. New emoji are being added to the specification daily and while a glyph like might look like a single "character" to a set of human eyes, and indeed in Unicode 6.0 is a single codepoint (U+1F46A), prior to Unicode 6.0 (and still FTR) it was still expressible using Zero Width Joining as five separate code points: [MAN][WZJ][WOMAN][WZJ][BOY] which mb_strlen() will tell you is five "characters" long, despite being visible as a single grapheme. Okay, so we look at the ICU grapheme functions, but depending on what version of the Unicode database is installed, that answer may be five or one. In short: Language is complicated and there's not a one-size-fits-all solution. -Sara
Re: [PHP-DEV] Multibyte strings
On Fri, Feb 11, 2022 at 12:26 AM Michał wrote: > It's a known fact that nowadays most websites use at least UTF-8 > encoding. Unfortunately PHP itself has stopped a bit in the previous > century. Is there any reason why the mbstring extension cannot be > introduced to core in the next major version (maybe preceded with a > deprecation message like it was with the mysql extension in v5)? All > functions from the standard library would become aliases for multibyte > equivalents. > > Only that it would break a great number of assumptions if strlen("é") after decades of returning 2 suddenly returned 1. That's a trite example, but it's the sort of deep rabbit hole that emerges when you start to really examine the problem in depth. Perhaps you're unfamiliar with the work that went into PHP 6. It turns out that building unicode into the heart of PHP isn't a new idea that you've just had, it's something which we invested a great deal of effort into and the discovery we made along the way is it's a great deal of complication and computational overhead for dubious benefit. Turns out that yes, developers do use UTF-8 almost exclusively and they know exactly when to use multi-byte aware functions and when octet focused functions make more sense. The landscape is covered in abstractions to make this simple and automatic, and suddenly changing the foundation would do more harm than good both in terms of developer productivity and performance. -Sara
Re: [PHP-DEV] Multibyte strings
>Пятница, 11 февраля 2022, 9:27 +03:00 от Michał : > >Hi everyone. >It's a known fact that nowadays most websites use at least UTF-8 >encoding. Unfortunately PHP itself has stopped a bit in the previous >century. Is there any reason why the mbstring extension cannot be >introduced to core in the next major version (maybe preceded with a >deprecation message like it was with the mysql extension in v5)? All >functions from the standard library would become aliases for multibyte >equivalents. > >-- >PHP Internals - PHP Runtime Development Mailing List >To unsubscribe, visit: https://www.php.net/unsub.php Hello, Michal! The functions for getting the length in bytes and the functions for getting the length of a string in characters are different functions for different tasks. That is, `mb_strlen` is not equivalent to `strlen` and cannot replace it: ``` $string = ‘Hell or world!’; echo ‘Bytes: ’ . \strlen($string) . "\n"; echo ‘Chars: ‘ . \mb_strlen($string); ``` When you work with data: sockets, row sizes in the database, shared memory, and so on, you operate with bytes. And the size in characters is rarely required, for example, to format the output in the console (with utf support). So answering your question about "when" - the answer is simple: This will never be done, because these are functions for different tasks ;) -- Kirill Nesmeyanov
Re: [PHP-DEV] Multibyte strings
On 11/02/2022 06:26, Michał wrote: Hi everyone. It's a known fact that nowadays most websites use at least UTF-8 encoding. Unfortunately PHP itself has stopped a bit in the previous century. Is there any reason why the mbstring extension cannot be introduced to core in the next major version (maybe preceded with a deprecation message like it was with the mysql extension in v5)? All functions from the standard library would become aliases for multibyte equivalents. Hi Michal, If only it were as simple as that... You might want to read up on the history of PHP 6.0, the version which never happened, because the project to introduce native Unicode strings turned out to be so complex, and introduce so many performance problems. There is a hint at part of the complexity in your phrasing "at least UTF-8 encoding" - there isn't really anything that's "more than" UTF-8, but there are certainly other common encodings - Windows-1252 mislabelled as ISO 8859-1 is a common one; UTF-16 has historically been common on Windows, and is a more efficient encoding in some contexts. So having PHP simply assume that all data is in UTF-8 won't work, you will always need to be able to represent a string of bytes and tell PHP to interpret it as some encoding. There are also many contexts (e.g. processing binary files) where interpreting strings as a sequence of bytes (as PHP does now) is absolutely correct. PHP 6.0 would have handled this similar to Python 3, with "binary strings" and "Unicode strings" as two separate types. There's also I think a myth in people's minds that something like "string length" has a single meaning, and PHP gets it "wrong" for multibyte strings; but actually the value given by functions like mb_strlen (the number of Unicode code points) is pretty useless - generally, people are actually interested in how many bytes the string will take up (as returned by PHP strlen) or how much space it will take up on screen (a really difficult question, but grapheme_strlen, which counts what you'd think of as "letters", is a better bet than counting code points, which can be individual accents). There probably *are* things PHP could do to improve Unicode handling, but it needs careful thought to avoid making everything worse. Regards, -- Rowan Tommins [IMSoP] -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: https://www.php.net/unsub.php
[PHP-DEV] Multibyte strings
Hi everyone. It's a known fact that nowadays most websites use at least UTF-8 encoding. Unfortunately PHP itself has stopped a bit in the previous century. Is there any reason why the mbstring extension cannot be introduced to core in the next major version (maybe preceded with a deprecation message like it was with the mysql extension in v5)? All functions from the standard library would become aliases for multibyte equivalents. -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: https://www.php.net/unsub.php