On 2025-04-28 20:35, Austin Group Issue Tracker via austin-group-l at
The Open Group wrote:
[...]
(0007158) hvd (reporter) - 2025-04-28 19:30
https://www.austingroupbugs.net/view.php?id=1920#c7158
----------------------------------------------------------------------
That wouldn't be enough to accurately specify what shells do even if
limited to
UTF-8. Since it's now the explicit intent that variables may contain
bytes that
do not form valid characters, we have to ask what happens when IFS
contains
bytes that do not form valid characters.
In UTF-8, é is encoded as 0xC3 0xA9. 0xA9 on its own is not a valid
character.
But IFS can be set to 0xA9. If IFS is set to 0xA9
[...]
[replying here as I apparently can no longer post on the issue tracker]
My understanding is that the behaviour is only defined if $IFS contains
text, and would be undefined if it contains byte sequences that cannot
be decoded into text like with your 0xA9 example.
So a compliant application (sh script) has to make sure $IFS is text
encoded in the locale's charmap.
Same applies to the delimiter passed to -d: must be single-byte (because
that's all that's supported by current implementations) and that byte
must be the encoding of a character per the user's charmap (so in UTF-8
locales, restricted to bytes 0 to 127). In UTF-8 locales (and I would
think all self-synchronizing encodings), that makes sure characters are
not cut in the middle, but not in all multibyte encodings in practice.
I'm fine with that. Trying to split on 0xA9 bytes in a UTF-8 locale
makes little sense, you'd want to switch to a locale with a single-byte
encoding, typically the C locale which is the one I'd tend to use to
work at byte level.
--
Stephane