On 22/03/2021 15:04, Aleksander Machniak wrote:

I'm using utf8_encode()/utf8_decode() to make input string safe to be
stored in DB, and back. In most cases the input is utf-8, but it
occasionally may contain "broken characters".


That is not what this function does, at all. The fact that its name makes you think that is exactly why I want to get rid of that name.


     $str = "グーグル谷歌中信фδοκιμήóźdźрöß😁😃";

     $this->assertSame($str, utf8_decode(utf8_encode($str)));


Let's write that out with a more descriptive function name:

$str = "グーグル谷歌中信фδοκιμήóźdźрöß😁😃";

$this->assertSame($str, utf8_to_latin1(latin1_to_utf8($str)));


Since Latin-1 does not contain any Chinese, Japanese, or Emoji characters, running latin1_to_uft8 on that string is clearly nonsensical.

The only reason it doesn't give you any errors is that every possible byte is a valid character in Latin1, and every Latin1 character has a Unicode code point. So the "グ" is interpreted as three Latin-1 characters: E3, 82, and B0; those then become the corresponding Unicode code points U+00E3, U+00821, and U+00B0, represented in UTF-8. You then run utf8_to_latin1, and they get converted back.

That code will never do anything useful.

Regards,

--
Rowan Tommins
[IMSoP]

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: https://www.php.net/unsub.php

Reply via email to