> Shift_JIS is very ambiguous, What will we do if SJIS-2004 or SJIS-win comes? > How do we guess(detect) SJIS-2004, SJIS-win and SJIS-mac?
I'm not the person you replied to in your previous email, but I thought to weigh in with what I can. My native language also uses multiple bytes, and have done a fair bit of character encoding conversions from one to another. The very reason why we have character encoding sets is to be able to reassign the same byte values to multiple real-life characters, so changing the character encodings from a non-UTF charset always carries some sort of "risk" of detecting the wrong source text encoding. Like Yuya Hamada mentioned in the rest of the previous email, 0xFC40 for example can map to two different characters. These are quite common occurrences, and there is even a word (Mojibake) for it! The most robust projects in this space are probably `enca` and `Chardet` (Python). However, theoretically, all tools can only guess the text encoding by inspecting common patterns and by checking if all bytes map to a meaningful glyph. When there is not a lot of text to inspect, these tools are very prone to make wrong results. When the source encoding is correctly detected or known, it's easy to re-encode files using `iconv`, followed by a quick `sed` to remove the `declare()` calls. --- That said, I'm hugely in favor of dropping support for non-UTF8 encodings. Because the source encoding is present in the INI settings or the declare statement, the site owners should be able to mass-encode text to UTF-8. Many languages like Rust only support UTF-8 (https://doc.rust-lang.org/reference/input-format.html), and I don't think any new PHP developers will expect PHP to work with non-UTF8 encodings in the first place. -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: https://www.php.net/unsub.php