It might be possible to make a locale UTF that converts at need between UTF-8 
and UTF-32. Issues like this can be resolved by referring to characters. A 
shortcoming of the C++ std::strings model is in effect assuming that these 
encodings are of different character sets.

I have made conversions between UTF-8 and UTF-32 that extend to full 32-bit, 
allowing all char32_t values to be converted, and reports UTF-8 errors in a way 
that the original byte sequence can be recovered, allowing for different types 
of error handling on top of this.

> ---------------------------------------------------------------------- 
> (0007158) hvd (reporter) - 2025-04-28 19:30
> https://www.austingroupbugs.net/view.php?id=1920#c7158 
> ---------------------------------------------------------------------- 
> That wouldn't be enough to accurately specify what shells do even if limited 
> to
> UTF-8. Since it's now the explicit intent that variables may contain bytes 
> that
> do not form valid characters, we have to ask what happens when IFS contains
> bytes that do not form valid characters.
> 
> In UTF-8, é is encoded as 0xC3 0xA9. 0xA9 on its own is not a valid character.
> But IFS can be set to 0xA9. If IFS is set to 0xA9, and X is set to 0xC3 0xA9
> 0xA9 0x40 (é, invalid byte, @), then in most locale-aware shells that I know 
> of
> that permit arbitrary bytes in variables (bash, gwsh, bosh, ksh), $X is split
> into two fields, the first one being 0xC3 0xA9, the second one being @. Most
> shells do not do any pure byte-based splitting. Exceptions are mksh which does
> appear to do exactly that (producing 0xC3, empty, 0x40), and zsh which does 
> not
> split at all on this case.
> 
> Clearly the current wording is defective. A long time ago I wrote on the 
> mailing
> list in more detail about what shells actually did with variables containing
> bytes that do not form valid characters in the context of pattern matching
> (subject: "[Issue 8 drafts 0001564]: clariy on what (character/byte) strings
> pattern matching notation should work") and asked whether there was any 
> interest
> in getting this standardized. There was no interest then. Given the mess that 
> we
> have now ended up with, please now actually look at what shells do, and 
> specify
> that, rather than coming up with more broken specs that only handle the 
> trivial
> cases.



  • [1003.1(20... Austin Group Issue Tracker via austin-group-l at The Open Group
  • [1003.1(20... Austin Group Issue Tracker via austin-group-l at The Open Group
  • [1003.1(20... Austin Group Issue Tracker via austin-group-l at The Open Group
  • [1003.1(20... Austin Group Issue Tracker via austin-group-l at The Open Group
  • [1003.1(20... Austin Group Issue Tracker via austin-group-l at The Open Group
  • [1003.1(20... Austin Group Issue Tracker via austin-group-l at The Open Group
  • [1003.1(20... Austin Group Issue Tracker via austin-group-l at The Open Group
  • [1003.1(20... Austin Group Issue Tracker via austin-group-l at The Open Group
  • [1003.1(20... Austin Group Issue Tracker via austin-group-l at The Open Group
    • Re: (... Stephane Chazelas via austin-group-l at The Open Group
    • Re: [... Hans Åberg via austin-group-l at The Open Group

Reply via email to