Re: multibyte processing - handling invalid sequences (long)

Pádraig Brady Sat, 23 Jul 2016 13:31:23 -0700

On 23/07/16 19:05, Assaf Gordon wrote:
> 
>> On Jul 23, 2016, at 06:51, Pádraig Brady <[email protected]> wrote:
>> I was wondering about the tool being line/record oriented.
>>
>> Disadvantages are:
>>  requires arbitrary large buffers for arbitrary long lines
>>  relatively slow in the presence of short/normal lines
>>  sensitive to the current stdio buffering mode
>>  requires -z option to support NUL termination
>>
>> Processing instead a block at a time avoid such issues.
>> UTF-8 at least is self synchronising, so after reading a block
>> you just have to look at the last 3 bytes to know
>> how many to append to the start of the next block.
> 
> block-at-a-time would work well for detecting/fixing invalid multibyte 
> sequences, especially in UTF-8.
> But I'm not sure about other multibyte encodings (I'll have to investigate).
> 
> However, for unicode normalization, I am not sure there's a stream interface 
> to it (gnu lib's uniform takes a whole string to normalize). IIUC, 
> normalization requires being able to examine some unicode characters ahead.


Oh right I see.

You're saying that splitting per line is a natural way to ensure
you don't split processing in the middle of a decomposed character,
which is significant in normalization processing.

To support that you'd have to do something like:

  filter = uninorm_filter_create()
  while (read(fd, buf, BUFSIZE))
    for each mbchar;
      uchar = mbtowchar(mbchar);
      if (!uchar) //fix
        uninorm_filter_write(filter, uchar);
    uninorm_filter_flush(filter)

I don't know how that would perform compared to u8_normalize().
It might be faster since we're already processing each char?
Or it might be slower if u8_normalize() has some utf8 specific optimizations.

cheers,
Pádraig

Re: multibyte processing - handling invalid sequences (long)

Reply via email to