On 23/07/16 19:05, Assaf Gordon wrote:
>
>> On Jul 23, 2016, at 06:51, Pádraig Brady <[email protected]> wrote:
>> I was wondering about the tool being line/record oriented.
>>
>> Disadvantages are:
>> requires arbitrary large buffers for arbitrary long lines
>> relatively slow in the presence of short/normal lines
>> sensitive to the current stdio buffering mode
>> requires -z option to support NUL termination
>>
>> Processing instead a block at a time avoid such issues.
>> UTF-8 at least is self synchronising, so after reading a block
>> you just have to look at the last 3 bytes to know
>> how many to append to the start of the next block.
>
> block-at-a-time would work well for detecting/fixing invalid multibyte
> sequences, especially in UTF-8.
> But I'm not sure about other multibyte encodings (I'll have to investigate).
>
> However, for unicode normalization, I am not sure there's a stream interface
> to it (gnu lib's uniform takes a whole string to normalize). IIUC,
> normalization requires being able to examine some unicode characters ahead.
Oh right I see.
You're saying that splitting per line is a natural way to ensure
you don't split processing in the middle of a decomposed character,
which is significant in normalization processing.
To support that you'd have to do something like:
filter = uninorm_filter_create()
while (read(fd, buf, BUFSIZE))
for each mbchar;
uchar = mbtowchar(mbchar);
if (!uchar) //fix
uninorm_filter_write(filter, uchar);
uninorm_filter_flush(filter)
I don't know how that would perform compared to u8_normalize().
It might be faster since we're already processing each char?
Or it might be slower if u8_normalize() has some utf8 specific optimizations.
cheers,
Pádraig