A NOTE has been added to this issue. ====================================================================== https://www.austingroupbugs.net/view.php?id=1920 ====================================================================== Reported By: stephane Assigned To: ====================================================================== Project: 1003.1(2013)/Issue7+TC1 Issue ID: 1920 Category: Shell and Utilities Type: Omission Severity: Objection Priority: normal Status: New Name: Stephane Chazelas Organization: User Reference: Section: read utility, stdin section Page Number: 3321 Line Number: 112915 Interp Status: --- Final Accepted Text: ====================================================================== Date Submitted: 2025-04-21 07:16 UTC Last Modified: 2025-04-28 19:30 UTC ====================================================================== Summary: read -d '' on invalid text without -r and IFS= ======================================================================
---------------------------------------------------------------------- (0007158) hvd (reporter) - 2025-04-28 19:30 https://www.austingroupbugs.net/view.php?id=1920#c7158 ---------------------------------------------------------------------- That wouldn't be enough to accurately specify what shells do even if limited to UTF-8. Since it's now the explicit intent that variables may contain bytes that do not form valid characters, we have to ask what happens when IFS contains bytes that do not form valid characters. In UTF-8, é is encoded as 0xC3 0xA9. 0xA9 on its own is not a valid character. But IFS can be set to 0xA9. If IFS is set to 0xA9, and X is set to 0xC3 0xA9 0xA9 0x40 (é, invalid byte, @), then in most locale-aware shells that I know of that permit arbitrary bytes in variables (bash, gwsh, bosh, ksh), $X is split into two fields, the first one being 0xC3 0xA9, the second one being @. Most shells do not do any pure byte-based splitting. Exceptions are mksh which does appear to do exactly that (producing 0xC3, empty, 0x40), and zsh which does not split at all on this case. Clearly the current wording is defective. A long time ago I wrote on the mailing list in more detail about what shells actually did with variables containing bytes that do not form valid characters in the context of pattern matching (subject: "[Issue 8 drafts 0001564]: clariy on what (character/byte) strings pattern matching notation should work") and asked whether there was any interest in getting this standardized. There was no interest then. Given the mess that we have now ended up with, please now actually look at what shells do, and specify that, rather than coming up with more broken specs that only handle the trivial cases. Issue History Date Modified Username Field Change ====================================================================== 2025-04-21 07:16 stephane New Issue 2025-04-21 07:30 stephane Note Added: 0007139 2025-04-21 07:38 stephane Note Edited: 0007139 2025-04-22 14:46 geoffclare Note Added: 0007140 2025-04-22 15:20 hvd Note Added: 0007141 2025-04-22 18:45 chet_ramey Note Added: 0007142 2025-04-23 14:59 dwheeler Note Added: 0007143 2025-04-23 16:47 stephane Note Added: 0007144 2025-04-23 16:57 stephane Note Added: 0007145 2025-04-23 19:53 dwheeler Note Added: 0007146 2025-04-24 10:54 geoffclare Note Added: 0007147 2025-04-24 10:55 geoffclare Note Edited: 0007147 2025-04-24 11:00 geoffclare Note Added: 0007148 2025-04-24 11:23 stephane Note Added: 0007149 2025-04-24 12:18 stephane Note Added: 0007150 2025-04-24 15:43 dwheeler Note Added: 0007152 2025-04-27 08:51 stephane Note Added: 0007155 2025-04-28 18:37 stephane Note Added: 0007156 2025-04-28 19:30 hvd Note Added: 0007158 ======================================================================