Re: multibyte processing - handling invalid sequences (long)

Assaf Gordon Tue, 26 Jul 2016 20:01:12 -0700

> On Jul 23, 2016, at 16:30, Pádraig Brady <[email protected]> wrote:
> 
> On 23/07/16 19:05, Assaf Gordon wrote:
>> 
>>> On Jul 23, 2016, at 06:51, Pádraig Brady <[email protected]> wrote:
>>> I was wondering about the tool being line/record oriented.
>>> 
>>> Disadvantages are:
>>> requires arbitrary large buffers for arbitrary long lines
>>> relatively slow in the presence of short/normal lines
>>> sensitive to the current stdio buffering mode
>>> requires -z option to support NUL termination
>>> 
>>> Processing instead a block at a time avoid such issues.
>>> UTF-8 at least is self synchronising, so after reading a block
>>> you just have to look at the last 3 bytes to know
>>> how many to append to the start of the next block.


Attached is a partial, crude implementation of stream-based processing.
It currently only handles fixing invalid sequences, no unicode normalization 
yet.

It contains both implementation, to ease comparison (use "-S/--stream" to use 
the new implementation, or without to use the previous line-based 
implementation).

The main functions are (to facilitate discussion):
mbbuf_read - reads more data from the input, moves 'incomplete/left-over' 
octets from previous read to the beginning of the buffer (somewhat like grep's 
fillbuf() but not as sophisticated).
STRM_unorm_buf - iterates over the octets in the current buffer
STRM_unorm_fd - repeatedly reads the file and calls STRM_unorm_buf.

The tests use both methods and the results are identical (except unicode 
normalization with is currently skipped for --stream).

Few issues are emerging:
1. If only validation is requires (i.e. no unicode normalization), it'll be 
wasteful to convert the input to wchar_t then back again. It'll be better to 
write the output as-is.  If unicode normalization is requested, then going 
through wchar_t and uniform's filter is needed. Perhaps two separate dedicated 
functions would be more efficient.

2. Regarding skipping STDIO buffering: I assume you referred to dealing with 
input. The code now uses file-descriptors and 'safe_read', thus bypassing stdio 
buffering on input. But it still uses stdio for output (this seems in line with 
tac, split, tr, etc.). If we want to bypass stdio as well, some extra code for 
internal buffering might be needed.

3. I believe that for this tool to be really useful, it should report the line 
number and column of offending/invalid octets. In that case, the code needs to 
count lines / columns, and will need to be aware of which line-terminator is 
used - meaning "-z" is still needed.
The attached code does count lines/columns (see struct mbbuffer), and thus is a 
bit cumbersome.
 
Currently it seems this optimization leads to somewhat more complicated code.
Once I'll have the unicode normalization implemented we could compare speeds 
and see which method is preferred.

Comments very welcomed,
 - assaf

0001-unorm-a-new-program-to-fix-and-normalize-multibyte-f.patch.xz
Description: Binary data

Re: multibyte processing - handling invalid sequences (long)

Reply via email to