On 13/12/2025 07:15, Collin Funk wrote:
* doc/coreutils.texi (dd invocation): Document the behavior of 'dd' on
multibyte characters and some unspecified behavior that will be
documented in a future POSIX release [1].
[1] https://austingroupbugs.net/view.php?id=1959
---
doc/coreutils.texi | 11 +++++++++++
1 file changed, 11 insertions(+)
diff --git a/doc/coreutils.texi b/doc/coreutils.texi
index d37cf2471..8ae81e110 100644
--- a/doc/coreutils.texi
+++ b/doc/coreutils.texi
@@ -9280,6 +9280,17 @@ @node dd invocation
The @samp{lcase} and @samp{ucase} conversions are mutually exclusive.
+@c https://austingroupbugs.net/view.php?id=1959
+POSIX leaves the behavior of @samp{lcase} and @samp{ucase} unspecified
+on multibyte characters. GNU @command{dd} only converts one byte at a
+time, because multibyte characters may cross block boundaries and case
+conversion may change the length of characters.
+
+POSIX also leaves the behavior of @samp{lcase} and @samp{ucase}
+unspecified if used with @samp{ascii}, @samp{ebcdic}, or @samp{ibm}.
+GNU @command{dd} will perform the case conversion and then perform the
+character set conversion.
+
@item sparse
@opindex sparse
Try to seek rather than write NUL output blocks.
Thanks for following up with the POSIX folks.
This clarification looks good and is worth making.
the dd interface never considered multi-byte locales,
so is best restricted to uni-byte IMHO.
cheers,
Padraig