Re: [PHP-DEV] Multibyte strings

2022-02-11 Thread Pierre Joye
good morning,

On Sat, Feb 12, 2022, 3:47 AM Rowan Tommins  wrote:

> On 11/02/2022 18:42, Michał wrote:
> > Considering the given example, the description from the documentation
> > of strlen function: "Returns the length of the given string".
>
>
> Which is exactly what it does. Using Unicode terminology [see
> https://unicode.org/glossary], here are a few different things you could
> count to determine the "length" of a string:
>
> a) bits
> b) bytes
> c) code units (UTF-16 has code units of 16 bits, UTF-8 has code units of
> 8 bits)
> d) code points (one of 1,112,064 numbers that can be given a meaning by
> the Unicode standard)
> e) graphemes (what a user would generally think of as a "character")
> f) pixels (or any other unit of physical size)
>

it is why we have intl, which uses the ICU and allow users to update it.
That means using the latest standard if needed.

best,
Pierre

>


Re: [PHP-DEV] Multibyte strings

2022-02-11 Thread Rowan Tommins

On 11/02/2022 18:42, Michał wrote:
Considering the given example, the description from the documentation 
of strlen function: "Returns the length of the given string".



Which is exactly what it does. Using Unicode terminology [see 
https://unicode.org/glossary], here are a few different things you could 
count to determine the "length" of a string:


a) bits
b) bytes
c) code units (UTF-16 has code units of 16 bits, UTF-8 has code units of 
8 bits)
d) code points (one of 1,112,064 numbers that can be given a meaning by 
the Unicode standard)

e) graphemes (what a user would generally think of as a "character")
f) pixels (or any other unit of physical size)

mb_strlen() will measure (d), which is frankly pretty useless - do you 
really need to know that "noél" is 5 code points long, but "noél" is 
only 4? (The first uses a combining diacritic, the other a pre-composed 
accented letter.)


Much more often you want strlen() to tell you (a) - one will take up 6 
bytes of storage and the other only 5; or grapheme_strlen() to tell you 
(e) - both have 4 graphemes.



The same goes for the "mb_strcut" function mentioned by Mel Dafert; try 
running this:


echo mb_strcut('noél', 3, 3, 'UTF-8');

https://3v4l.org/s2SsR

The algorithm "correctly" keeps all the bytes of the acute accent, but 
drops the "e" it was on top of; probably not a very useful result.



And that's before we get to functions which should behave differently in 
different languages, like correctly capitalising "i" in Turkish: 
https://en.wikipedia.org/wiki/Dotted_and_dotless_I


Doing this stuff right is really, really difficult; and that is the 
reason it doesn't just "work out of the box".



Regards,

--
Rowan Tommins
[IMSoP]

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: https://www.php.net/unsub.php



Re: [PHP-DEV] Multibyte strings

2022-02-11 Thread Mel Dafert
On 11 February 2022 07:26:45 CET, "Michał"  wrote:
>Hi everyone.
>It's a known fact that nowadays most websites use at least UTF-8 
>encoding. Unfortunately PHP itself has stopped a bit in the previous 
>century. Is there any reason why the mbstring extension cannot be 
>introduced to core in the next major version (maybe preceded with a 
>deprecation message like it was with the mysql extension in v5)? All 
>functions from the standard library would become aliases for multibyte 
>equivalents.

As others have said, any change to behaviour in something as subtle as
string encoding makes little sense (see PHP 6 or the mess that was the migration
from Python 2 to 3, which did exactly that).

However, I do see an argument to be made to make the mbstring extension
always available, similar to what was done with the json extension [1].
Currently, one cannot assume to have access to things like mb_strcut, which
makes writing code that does not break when it's fed UTF-8 relatively 
complicated.

Frameworks like Drupal also require mbstring for anything other than English
content [2].

The manual [3] also says that it does not require any external libraries, so 
there
does not seem to be any technical obstacle either.

Would that be an option? Or am I missing some obvious reason that mbstring
should not be always available, like licensing issues?

Regards,
Mel

[1] https://wiki.php.net/rfc/always_enable_json
[2] https://www.drupal.org/docs/system-requirements/php-requirements#s-mbstring-
[3] https://www.php.net/manual/en/mbstring.requirements.php

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: https://www.php.net/unsub.php



Re: [PHP-DEV] Multibyte strings

2022-02-11 Thread Michał


This++.

  Unicode is not a static standard definition of all characters.  New emoji
are being added to the specification daily and while a glyph like  might
look like a single "character" to a set of human eyes, and indeed in
Unicode 6.0 is a single codepoint (U+1F46A), prior to Unicode 6.0 (and
still FTR) it was still expressible using Zero Width Joining as five
separate code points: [MAN][WZJ][WOMAN][WZJ][BOY] which mb_strlen() will
tell you is five "characters" long, despite being visible as a single
grapheme.  Okay, so we look at the ICU grapheme functions, but depending on
what version of the Unicode database is installed, that answer may be five
or one.

In short: Language is complicated and there's not a one-size-fits-all
solution.

-Sara



Thank You Sara for a great example. I didn't know that the topic was 
covered in PHP6.


--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: https://www.php.net/unsub.php



Re: [PHP-DEV] Multibyte strings

2022-02-11 Thread Michał

W dniu 11.02.2022 o 16:41, Kirill Nesmeyanov pisze:

```
$string = ‘Hell  or  world!’;

echo ‘Bytes: ’ . \strlen($string) . "\n";
echo ‘Chars: ‘ . \mb_strlen($string);
```


Thanks Kirill for Your answer.
I totally agree that stream and text functions are two different things. 
However, in the context of cleaning up the PHP language, the 
inconsistency is very disturbing. Considering the given example, the 
description from the documentation of strlen function: "Returns the 
length of the given string". Only below that you can find the note that 
function "returns the number of bytes". So strlen is in the virtual 
namespace String (String functions), its description says that it should 
return the length of the string, but if you specify a multibyte string 
it returns the number of bytes, not the number of characters. In that 
case there should be a bytes_length function, or something like 
Stream::fromString(string $string)->getSize(); (StreamInterface from 
PSR-7 is also a great example). So, using the example given, a natural 
and logical approach would be:


```
$string = ‘Hell  or  world!’;

echo ‘Bytes: ’ . \bytes_length($string) . "\n";
echo ‘Chars: ‘ . \strlen($string); // in that case alias for mb_strlen
```

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: https://www.php.net/unsub.php



Re: [PHP-DEV] Multibyte strings

2022-02-11 Thread Sara Golemon
On Fri, Feb 11, 2022 at 3:14 AM Rowan Tommins 
wrote:

> There's also I think a myth in people's minds that something like
> "string length" has a single meaning, and PHP gets it "wrong" for
> multibyte strings;
>

This++.

 Unicode is not a static standard definition of all characters.  New emoji
are being added to the specification daily and while a glyph like  might
look like a single "character" to a set of human eyes, and indeed in
Unicode 6.0 is a single codepoint (U+1F46A), prior to Unicode 6.0 (and
still FTR) it was still expressible using Zero Width Joining as five
separate code points: [MAN][WZJ][WOMAN][WZJ][BOY] which mb_strlen() will
tell you is five "characters" long, despite being visible as a single
grapheme.  Okay, so we look at the ICU grapheme functions, but depending on
what version of the Unicode database is installed, that answer may be five
or one.

In short: Language is complicated and there's not a one-size-fits-all
solution.

-Sara


Re: [PHP-DEV] Multibyte strings

2022-02-11 Thread Sara Golemon
On Fri, Feb 11, 2022 at 12:26 AM Michał  wrote:

> It's a known fact that nowadays most websites use at least UTF-8
> encoding. Unfortunately PHP itself has stopped a bit in the previous
> century. Is there any reason why the mbstring extension cannot be
> introduced to core in the next major version (maybe preceded with a
> deprecation message like it was with the mysql extension in v5)? All
> functions from the standard library would become aliases for multibyte
> equivalents.
>
>
Only that it would break a great number of assumptions if strlen("é") after
decades of returning 2 suddenly returned 1.  That's a trite example, but
it's the sort of deep rabbit hole that emerges when you start to really
examine the problem in depth.

Perhaps you're unfamiliar with the work that went into PHP 6. It turns out
that building unicode into the heart of PHP isn't a new idea that you've
just had, it's something which we invested a great deal of effort into and
the discovery we made along the way is it's a great deal of
complication and computational overhead for dubious benefit.  Turns out
that yes, developers do use UTF-8 almost exclusively and they know exactly
when to use multi-byte aware functions and when octet focused functions
make more sense.  The landscape is covered in abstractions to make this
simple and automatic, and suddenly changing the foundation would do more
harm than good both in terms of developer productivity and performance.

-Sara


Re: [PHP-DEV] Multibyte strings

2022-02-11 Thread Kirill Nesmeyanov

>Пятница, 11 февраля 2022, 9:27 +03:00 от Michał :
> 
>Hi everyone.
>It's a known fact that nowadays most websites use at least UTF-8
>encoding. Unfortunately PHP itself has stopped a bit in the previous
>century. Is there any reason why the mbstring extension cannot be
>introduced to core in the next major version (maybe preceded with a
>deprecation message like it was with the mysql extension in v5)? All
>functions from the standard library would become aliases for multibyte
>equivalents.
>
>--
>PHP Internals - PHP Runtime Development Mailing List
>To unsubscribe, visit:  https://www.php.net/unsub.php


Hello, Michal!

The functions for getting the length in bytes and the functions for getting the 
length of a string in characters are different functions for different tasks.

That is, `mb_strlen` is not equivalent to `strlen` and cannot replace it:

```
$string = ‘Hell  or  world!’;

echo ‘Bytes: ’ . \strlen($string) . "\n";
echo ‘Chars: ‘ . \mb_strlen($string);
```

When you work with data: sockets, row sizes in the database, shared memory, and 
so on, you operate with bytes. And the size in characters is rarely required, 
for example, to format the output in the console (with utf support).

So answering your question about "when" - the answer is simple: This will never 
be done, because these are functions for different tasks ;)
 
 
--
Kirill Nesmeyanov
 

Re: [PHP-DEV] Multibyte strings

2022-02-11 Thread Rowan Tommins

On 11/02/2022 06:26, Michał wrote:

Hi everyone.
It's a known fact that nowadays most websites use at least UTF-8 
encoding. Unfortunately PHP itself has stopped a bit in the previous 
century. Is there any reason why the mbstring extension cannot be 
introduced to core in the next major version (maybe preceded with a 
deprecation message like it was with the mysql extension in v5)? All 
functions from the standard library would become aliases for multibyte 
equivalents.




Hi Michal,

If only it were as simple as that...

You might want to read up on the history of PHP 6.0, the version which 
never happened, because the project to introduce native Unicode strings 
turned out to be so complex, and introduce so many performance problems.


There is a hint at part of the complexity in your phrasing "at least 
UTF-8 encoding" - there isn't really anything that's "more than" UTF-8, 
but there are certainly other common encodings - Windows-1252 
mislabelled as ISO 8859-1 is a common one; UTF-16 has historically been 
common on Windows, and is a more efficient encoding in some contexts. So 
having PHP simply assume that all data is in UTF-8 won't work, you will 
always need to be able to represent a string of bytes and tell PHP to 
interpret it as some encoding. There are also many contexts (e.g. 
processing binary files) where interpreting strings as a sequence of 
bytes (as PHP does now) is absolutely correct. PHP 6.0 would have 
handled this similar to Python 3, with "binary strings" and "Unicode 
strings" as two separate types.


There's also I think a myth in people's minds that something like 
"string length" has a single meaning, and PHP gets it "wrong" for 
multibyte strings; but actually the value given by functions like 
mb_strlen (the number of Unicode code points) is pretty useless - 
generally, people are actually interested in how many bytes the string 
will take up (as returned by PHP strlen) or how much space it will take 
up on screen (a really difficult question, but grapheme_strlen, which 
counts what you'd think of as "letters", is a better bet than counting 
code points, which can be individual accents).


There probably *are* things PHP could do to improve Unicode handling, 
but it needs careful thought to avoid making everything worse.


Regards,

--
Rowan Tommins
[IMSoP]

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: https://www.php.net/unsub.php



[PHP-DEV] Multibyte strings

2022-02-10 Thread Michał

Hi everyone.
It's a known fact that nowadays most websites use at least UTF-8 
encoding. Unfortunately PHP itself has stopped a bit in the previous 
century. Is there any reason why the mbstring extension cannot be 
introduced to core in the next major version (maybe preceded with a 
deprecation message like it was with the mysql extension in v5)? All 
functions from the standard library would become aliases for multibyte 
equivalents.


--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: https://www.php.net/unsub.php