Collin Funk <[email protected]> writes: > We discussed this patch off list and are going to leave it for a future > release. But I figured I would post it here for others to try and so I > do not lose it. > > The patch handles multi-byte characters when invoking > 'uniq --ignore-case' while perserving performance in the case of > LC_ALL=C and the case without --ignore-case. > > $ yes abcdefghijklmnopqrstuvwxyz | head -n 10000000 > test.txt > > $ export LC_ALL=en_US.UTF-8 > $ time ./src/uniq-new test.txt > real 0m0.420s > $ time ./src/uniq-new --ignore-case test.txt > real 0m0.761s > > $ export LC_ALL=C > $ time ./src/uniq-new test.txt > real 0m0.425s > $ time ./src/uniq-new --ignore-case test.txt > real 0m0.485s > > $ export LC_ALL=en_US.UTF-8 > $ time ./src/uniq-old test.txt > real 0m0.420s > $ time ./src/uniq-old --ignore-case test.txt > real 0m0.437s > > $ export LC_ALL=C > $ time ./src/uniq-old test.txt > real 0m0.416s > $ time ./src/uniq-old --ignore-case test.txt > real 0m0.626s
Okay to push this after 'sed s/framework_failure/&_/' in the test to fix syntax-check and a NEWS entry? It should be the only thing needed for 'uniq' to handle multi-byte characters. The only delimiters used are '\n' and '\0' which cannot be in multi-byte characters (assuming a sane encoding). Therefore the linebuffer.h functions work and are efficient. Collin
