Pádraig Brady <[email protected]> writes:

> On 13/12/2025 22:40, Grisha Levit wrote:
>> On Sat, Dec 13, 2025, 02:16 Collin Funk <[email protected]> wrote:
>>>
>>> +@c https://austingroupbugs.net/view.php?id=1959
>>> +POSIX leaves the behavior of @samp{lcase} and @samp{ucase} unspecified
>>> +on multibyte characters.  GNU @command{dd} only converts one byte at a
>>> +time,
>> I wonder if it may be ambiguous if "converts one byte at a time"
>> means
>> "reads one byte and converts it" or "reads one byte and converts it to
>> one byte".  This seems to leave open the possibility that the "i" will
>> be converted in something like:
>>      $ LC_ALL=tr_TR.utf8 dd conv=ucase <<< hij
>>      HiJ
>> 
>>> because multibyte characters may cross block boundaries and case
>>> +conversion may change the length of characters.
>> But, OTOH, the meaning might be obvious enough from the rationale.
>
> It's a bit of a stretch to think it may output multibyte,
> but I do see the ambiguity.  Perhaps this is better:
>
> +POSIX leaves the behavior of @samp{lcase} and @samp{ucase} unspecified
> +on multibyte characters.  GNU @command{dd} supports only unibyte conversion,
> +because multibyte characters may cross block boundaries and case
> +conversion may change the length of characters.

Thank you both for the help. I pushed that proposed wording.

I wasn't a fan of the wording in my original patch, but couldn't think
of anything better at the time.

> This might better allude to the fact we fully support
> conversion in any unibyte locale, as seen in:
>
>   $ printf '%q\n' $(printf 'hi' | LC_ALL=tr_TR dd status=none conv=ucase)
>    $'H\335'

Yep, agreed.

I also thought it was worth making clear that we support:

    $ printf '%s\n' \
        $(printf 'abc' | dd conv=ucase,ebcdic status=none \
            | iconv -f EBCDIC-US -t ASCII)
    ABC

The POSIX issue makes the behavior there unspecified.

Collin

Reply via email to