It might be possible to make a locale UTF that converts at need between UTF-8 and UTF-32. Issues like this can be resolved by referring to characters. A shortcoming of the C++ std::strings model is in effect assuming that these encodings are of different character sets.
I have made conversions between UTF-8 and UTF-32 that extend to full 32-bit, allowing all char32_t values to be converted, and reports UTF-8 errors in a way that the original byte sequence can be recovered, allowing for different types of error handling on top of this. > ---------------------------------------------------------------------- > (0007158) hvd (reporter) - 2025-04-28 19:30 > https://www.austingroupbugs.net/view.php?id=1920#c7158 > ---------------------------------------------------------------------- > That wouldn't be enough to accurately specify what shells do even if limited > to > UTF-8. Since it's now the explicit intent that variables may contain bytes > that > do not form valid characters, we have to ask what happens when IFS contains > bytes that do not form valid characters. > > In UTF-8, é is encoded as 0xC3 0xA9. 0xA9 on its own is not a valid character. > But IFS can be set to 0xA9. If IFS is set to 0xA9, and X is set to 0xC3 0xA9 > 0xA9 0x40 (é, invalid byte, @), then in most locale-aware shells that I know > of > that permit arbitrary bytes in variables (bash, gwsh, bosh, ksh), $X is split > into two fields, the first one being 0xC3 0xA9, the second one being @. Most > shells do not do any pure byte-based splitting. Exceptions are mksh which does > appear to do exactly that (producing 0xC3, empty, 0x40), and zsh which does > not > split at all on this case. > > Clearly the current wording is defective. A long time ago I wrote on the > mailing > list in more detail about what shells actually did with variables containing > bytes that do not form valid characters in the context of pattern matching > (subject: "[Issue 8 drafts 0001564]: clariy on what (character/byte) strings > pattern matching notation should work") and asked whether there was any > interest > in getting this standardized. There was no interest then. Given the mess that > we > have now ended up with, please now actually look at what shells do, and > specify > that, rather than coming up with more broken specs that only handle the > trivial > cases.