Hello,

Attached an improved version of 'unorm', with unicode-normalization support for 
both line-by-line and buffer/stream methods.
The code is not yet cleaned-up, but enables comparing performances of the two 
approaches.

Briefly, it seems the buffer/stream method is more or less as fast as 
line-by-line when *not* doing unicode normalization
(only one 'mbrtowc' call is done for each input character in both cases).

When unicode normalization is used, the line-by-line is faster, likely because:
a) using u8_normalize on the entire buffer instead of uninorm_filter streaming,
b) an additional wctomb is required for each output character when using 
uninorm-filter streaming.

It's likely the buffer/streaming implementation could be improved.

However, there's an issue with uninorm-filter:
The functions (e.g. uninorm_filter_write in gnulib's uninorm.h) use 'ucs4_t',
but mbrtowc/wctomb use 'wchar_t'.
Is there a guarantee that wchar_t is actually a unicode code-point? (I couldn't 
find one.)
Currently the code assumes that they are one and the same.
If that's incorrect assumption, additional conversion will be needed.

The following commands can be used to compare implementations ('-S' uses the 
buffer/stream method instead of line-by-line):

Short lines, with/out normalization:

    yes a | head -n 10M > data1

    env time ./src/unorm    < data1 > /dev/null
    env time ./src/unorm -S < data1 > /dev/null

    env time ./src/unorm -nfkc    < data1 > /dev/null
    env time ./src/unorm -nfkc -S < data1 > /dev/null

Long lines, with/out normalization:

    yes | perl -npe '$_ = "x" x int(rand(10000)) . "\n"' | head -n 50K > data2

    env time ./src/unorm    < data2 > /dev/null
    env time ./src/unorm -S < data2 > /dev/null

    env time ./src/unorm    -nfkc < data2 > /dev/null
    env time ./src/unorm -S -nfkc < data2 > /dev/null

Comments welcomed,
 - assaf


Attachment: unorm-2016-08-07.patch.xz
Description: Binary data

Reply via email to