Re: uniq: support multi-byte characters with --ignore-case

Collin Funk Sat, 18 Oct 2025 07:33:00 -0700

Collin Funk <[email protected]> writes:

> We discussed this patch off list and are going to leave it for a future
> release. But I figured I would post it here for others to try and so I
> do not lose it.
>
> The patch handles multi-byte characters when invoking
> 'uniq --ignore-case' while perserving performance in the case of
> LC_ALL=C and the case without --ignore-case.
>
>     $ yes abcdefghijklmnopqrstuvwxyz | head -n 10000000 > test.txt
>
>     $ export LC_ALL=en_US.UTF-8 
>     $ time ./src/uniq-new test.txt    
>     real      0m0.420s
>     $ time ./src/uniq-new --ignore-case test.txt
>     real      0m0.761s
>
>     $ export LC_ALL=C
>     $ time ./src/uniq-new test.txt
>     real      0m0.425s
>     $ time ./src/uniq-new --ignore-case test.txt
>     real      0m0.485s
>     
>     $ export LC_ALL=en_US.UTF-8 
>     $ time ./src/uniq-old test.txt
>     real      0m0.420s
>     $ time ./src/uniq-old --ignore-case test.txt
>     real      0m0.437s
>
>     $ export LC_ALL=C
>     $ time ./src/uniq-old test.txt
>     real      0m0.416s
>     $ time ./src/uniq-old --ignore-case test.txt
>     real      0m0.626s


Okay to push this after 'sed s/framework_failure/&_/' in the test to fix
syntax-check and a NEWS entry?

It should be the only thing needed for 'uniq' to handle multi-byte
characters. The only delimiters used are '\n' and '\0' which cannot be
in multi-byte characters (assuming a sane encoding). Therefore the
linebuffer.h functions work and are efficient.

Collin

Re: uniq: support multi-byte characters with --ignore-case

Reply via email to