> On Jul 23, 2016, at 16:30, Pádraig Brady <[email protected]> wrote: > > On 23/07/16 19:05, Assaf Gordon wrote: >> >>> On Jul 23, 2016, at 06:51, Pádraig Brady <[email protected]> wrote: >>> I was wondering about the tool being line/record oriented. >>> >>> Disadvantages are: >>> requires arbitrary large buffers for arbitrary long lines >>> relatively slow in the presence of short/normal lines >>> sensitive to the current stdio buffering mode >>> requires -z option to support NUL termination >>> >>> Processing instead a block at a time avoid such issues. >>> UTF-8 at least is self synchronising, so after reading a block >>> you just have to look at the last 3 bytes to know >>> how many to append to the start of the next block.
Attached is a partial, crude implementation of stream-based processing. It currently only handles fixing invalid sequences, no unicode normalization yet. It contains both implementation, to ease comparison (use "-S/--stream" to use the new implementation, or without to use the previous line-based implementation). The main functions are (to facilitate discussion): mbbuf_read - reads more data from the input, moves 'incomplete/left-over' octets from previous read to the beginning of the buffer (somewhat like grep's fillbuf() but not as sophisticated). STRM_unorm_buf - iterates over the octets in the current buffer STRM_unorm_fd - repeatedly reads the file and calls STRM_unorm_buf. The tests use both methods and the results are identical (except unicode normalization with is currently skipped for --stream). Few issues are emerging: 1. If only validation is requires (i.e. no unicode normalization), it'll be wasteful to convert the input to wchar_t then back again. It'll be better to write the output as-is. If unicode normalization is requested, then going through wchar_t and uniform's filter is needed. Perhaps two separate dedicated functions would be more efficient. 2. Regarding skipping STDIO buffering: I assume you referred to dealing with input. The code now uses file-descriptors and 'safe_read', thus bypassing stdio buffering on input. But it still uses stdio for output (this seems in line with tac, split, tr, etc.). If we want to bypass stdio as well, some extra code for internal buffering might be needed. 3. I believe that for this tool to be really useful, it should report the line number and column of offending/invalid octets. In that case, the code needs to count lines / columns, and will need to be aware of which line-terminator is used - meaning "-z" is still needed. The attached code does count lines/columns (see struct mbbuffer), and thus is a bit cumbersome. Currently it seems this optimization leads to somewhat more complicated code. Once I'll have the unicode normalization implemented we could compare speeds and see which method is preferred. Comments very welcomed, - assaf
0001-unorm-a-new-program-to-fix-and-normalize-multibyte-f.patch.xz
Description: Binary data
