On 22/03/2021 15:04, Aleksander Machniak wrote:
I'm using utf8_encode()/utf8_decode() to make input string safe to be
stored in DB, and back. In most cases the input is utf-8, but it
occasionally may contain "broken characters".
That is not what this function does, at all. The fact that its name
makes you think that is exactly why I want to get rid of that name.
$str = "グーグル谷歌中信фδοκιμήóźdźрöß😁😃";
$this->assertSame($str, utf8_decode(utf8_encode($str)));
Let's write that out with a more descriptive function name:
$str = "グーグル谷歌中信фδοκιμήóźdźрöß😁😃";
$this->assertSame($str, utf8_to_latin1(latin1_to_utf8($str)));
Since Latin-1 does not contain any Chinese, Japanese, or Emoji
characters, running latin1_to_uft8 on that string is clearly nonsensical.
The only reason it doesn't give you any errors is that every possible
byte is a valid character in Latin1, and every Latin1 character has a
Unicode code point. So the "グ" is interpreted as three Latin-1
characters: E3, 82, and B0; those then become the corresponding Unicode
code points U+00E3, U+00821, and U+00B0, represented in UTF-8. You then
run utf8_to_latin1, and they get converted back.
That code will never do anything useful.
Regards,
--
Rowan Tommins
[IMSoP]
--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: https://www.php.net/unsub.php