Re: (1920) splitting by bytes in UTF-8

Stephane Chazelas via austin-group-l at The Open Group Mon, 28 Apr 2025 14:21:31 -0700

On 2025-04-28 20:35, Austin Group Issue Tracker via austin-group-l atThe Open Group wrote:

[...]

 (0007158) hvd (reporter) - 2025-04-28 19:30
 https://www.austingroupbugs.net/view.php?id=1920#c7158
----------------------------------------------------------------------
That wouldn't be enough to accurately specify what shells do even iflimited toUTF-8. Since it's now the explicit intent that variables may containbytes thatdo not form valid characters, we have to ask what happens when IFScontains
bytes that do not form valid characters.
In UTF-8, é is encoded as 0xC3 0xA9. 0xA9 on its own is not a validcharacter.
But IFS can be set to 0xA9. If IFS is set to 0xA9

[...]

[replying here as I apparently can no longer post on the issue tracker]

My understanding is that the behaviour is only defined if $IFS containstext, and would be undefined if it contains byte sequences that cannotbe decoded into text like with your 0xA9 example.

So a compliant application (sh script) has to make sure $IFS is textencoded in the locale's charmap.

Same applies to the delimiter passed to -d: must be single-byte (becausethat's all that's supported by current implementations) and that bytemust be the encoding of a character per the user's charmap (so in UTF-8locales, restricted to bytes 0 to 127). In UTF-8 locales (and I wouldthink all self-synchronizing encodings), that makes sure characters arenot cut in the middle, but not in all multibyte encodings in practice.

I'm fine with that. Trying to split on 0xA9 bytes in a UTF-8 localemakes little sense, you'd want to switch to a locale with a single-byteencoding, typically the C locale which is the one I'd tend to use towork at byte level.


--
Stephane

Re: (1920) splitting by bytes in UTF-8

Reply via email to