> Shift_JIS is very ambiguous, What will we do if SJIS-2004 or SJIS-win comes?
> How do we guess(detect) SJIS-2004, SJIS-win and SJIS-mac?

I'm not the person you replied to in your previous email, but I
thought to weigh in with what I can. My native language also uses
multiple bytes, and have done a fair bit of character encoding
conversions from one to another.

The very reason why we have character encoding sets is to be able to
reassign the same byte values to multiple real-life characters, so
changing the character encodings from a non-UTF charset always carries
some sort of "risk" of detecting the wrong source text encoding. Like
Yuya Hamada mentioned in the rest of the previous email, 0xFC40 for
example can map to two different characters. These are quite common
occurrences, and there is even a word (Mojibake) for it!

The most robust projects in this space are probably `enca` and
`Chardet` (Python). However, theoretically, all tools can only guess
the text encoding by inspecting common patterns and by checking if all
bytes map to a meaningful glyph. When there is not a lot of text to
inspect, these tools are very prone to make wrong results.

When the source encoding is correctly detected or known, it's easy to
re-encode files using `iconv`, followed by a quick `sed` to remove the
`declare()` calls.

---

That said, I'm hugely in favor of dropping support for non-UTF8
encodings. Because the source encoding is present in the INI settings
or the declare statement, the site owners should be able to
mass-encode text to UTF-8. Many languages like Rust only support UTF-8
(https://doc.rust-lang.org/reference/input-format.html), and I don't
think any new PHP developers will expect PHP to work with non-UTF8
encodings in the first place.

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: https://www.php.net/unsub.php

Reply via email to