Pádraig Brady <[email protected]> writes:
> On 13/12/2025 22:40, Grisha Levit wrote:
>> On Sat, Dec 13, 2025, 02:16 Collin Funk <[email protected]> wrote:
>>>
>>> +@c https://austingroupbugs.net/view.php?id=1959
>>> +POSIX leaves the behavior of @samp{lcase} and @samp{ucase} unspecified
>>> +on multibyte characters. GNU @command{dd} only converts one byte at a
>>> +time,
>> I wonder if it may be ambiguous if "converts one byte at a time"
>> means
>> "reads one byte and converts it" or "reads one byte and converts it to
>> one byte". This seems to leave open the possibility that the "i" will
>> be converted in something like:
>> $ LC_ALL=tr_TR.utf8 dd conv=ucase <<< hij
>> HiJ
>>
>>> because multibyte characters may cross block boundaries and case
>>> +conversion may change the length of characters.
>> But, OTOH, the meaning might be obvious enough from the rationale.
>
> It's a bit of a stretch to think it may output multibyte,
> but I do see the ambiguity. Perhaps this is better:
>
> +POSIX leaves the behavior of @samp{lcase} and @samp{ucase} unspecified
> +on multibyte characters. GNU @command{dd} supports only unibyte conversion,
> +because multibyte characters may cross block boundaries and case
> +conversion may change the length of characters.
Thank you both for the help. I pushed that proposed wording.
I wasn't a fan of the wording in my original patch, but couldn't think
of anything better at the time.
> This might better allude to the fact we fully support
> conversion in any unibyte locale, as seen in:
>
> $ printf '%q\n' $(printf 'hi' | LC_ALL=tr_TR dd status=none conv=ucase)
> $'H\335'
Yep, agreed.
I also thought it was worth making clear that we support:
$ printf '%s\n' \
$(printf 'abc' | dd conv=ucase,ebcdic status=none \
| iconv -f EBCDIC-US -t ASCII)
ABC
The POSIX issue makes the behavior there unspecified.
Collin