Re: multibyte processing - handling invalid sequences (long)

Assaf Gordon Sat, 23 Jul 2016 11:06:07 -0700

> On Jul 23, 2016, at 06:51, Pádraig Brady <[email protected]> wrote:
> I was wondering about the tool being line/record oriented.
> 
> Disadvantages are:
>  requires arbitrary large buffers for arbitrary long lines
>  relatively slow in the presence of short/normal lines
>  sensitive to the current stdio buffering mode
>  requires -z option to support NUL termination
> 
> Processing instead a block at a time avoid such issues.
> UTF-8 at least is self synchronising, so after reading a block
> you just have to look at the last 3 bytes to know
> how many to append to the start of the next block.


block-at-a-time would work well for detecting/fixing invalid multibyte 
sequences, especially in UTF-8.
But I'm not sure about other multibyte encodings (I'll have to investigate).

However, for unicode normalization, I am not sure there's a stream interface to 
it (gnu lib's uniform takes a whole string to normalize). IIUC, normalization 
requires being able to examine some unicode characters ahead.

-assaf

Re: multibyte processing - handling invalid sequences (long)

Reply via email to