> On Jul 23, 2016, at 06:51, Pádraig Brady <[email protected]> wrote: > I was wondering about the tool being line/record oriented. > > Disadvantages are: > requires arbitrary large buffers for arbitrary long lines > relatively slow in the presence of short/normal lines > sensitive to the current stdio buffering mode > requires -z option to support NUL termination > > Processing instead a block at a time avoid such issues. > UTF-8 at least is self synchronising, so after reading a block > you just have to look at the last 3 bytes to know > how many to append to the start of the next block.
block-at-a-time would work well for detecting/fixing invalid multibyte sequences, especially in UTF-8. But I'm not sure about other multibyte encodings (I'll have to investigate). However, for unicode normalization, I am not sure there's a stream interface to it (gnu lib's uniform takes a whole string to normalize). IIUC, normalization requires being able to examine some unicode characters ahead. -assaf
