On 22/07/16 04:23, Assaf Gordon wrote: > Hello, > >> On Jul 21, 2016, at 06:08, Pádraig Brady <[email protected]> wrote: >> [...] >> It seems like --normalization={NFKD,NFKD,NFC,NFD} functionality would >> also be quite cohesive in such a util. > > Attached an improved version with unicode normalization support. > > Before continuing with other stuff (e.g. more tests, documentation, news, > etc.), > it's worth discussing if this is the path to take (or if we want to add this > to each individual utility). > Also, do we keep these options or modify them? > e.g. 'uconv' uses different terminology for handling invalid sequences: stop, > skip, substitute, escape (corresponding to abort, discard, replace, recode > below). > > To keep the implementation simple, unicode normalization requires UTF-8 > locales - is this a valid requirement? > > And of course, what about the name? > > Comments welcomed, > - assaf > > > > > Example (from 'Unicode Explained' book): > =========== > $ printf '\uFB01anc\u00E9\n' > fiancé > > $ printf '\uFB01anc\u00E9\n' | ./src/mbfix -n nfd | od -An -tx1c > ef ac 81 61 6e 63 65 cc 81 0a > ? ? 201 a n c e ? 201 \n > > $ printf '\uFB01anc\u00E9\n' | ./src/mbfix -n nfc | od -An -tx1c > ef ac 81 61 6e 63 c3 a9 0a > ? ? 201 a n c ? ? \n > > $ printf '\uFB01anc\u00E9\n' | ./src/mbfix -n nfkd | od -An -tx1c > 66 69 61 6e 63 65 cc 81 0a > f i a n c e ? 201 \n > > $ printf '\uFB01anc\u00E9\n' | ./src/mbfix -n nfkc | od -An -tx1c > 66 69 61 6e 63 c3 a9 0a > f i a n c ? ? \n > > $ ./src/mbfix --help > Usage: ./src/mbfix [OPTION]... [FILE]... > Fix and adjust multibyte character in files > > Mandatory arguments to long options are mandatory for short options too. > -A, --abort same as --policy=abort > -C, --recode same as --policy=recode > -c, --check validate input, no output > -D, --discard same as --policy=discard > -n, --normalization=NORM > apply unicode normalization NORM:, one of: > nfd, nfc, nfkd, nfkc. Normalization requires > UTF-8 locales. > -p, --policy=POLICY invalid-input policy: discard, abort > replace (default), recode > -R, --replace same as --policy=replace > --replace-char=N > with 'replace' policy, use unicode character N > (default: 0xFFFD 'REPLACEMENT CHARACTER') > --recode-format=FMT > with 'recode' policy, recode invalid octets > using FMT printf-format (default: '<0x%02x>') > -v, --verbose report location of invalid input > -z, --zero-terminated line delimiter is NUL, not newline > --help display this help and exit > --version output version information and exit
I was wondering about the tool being line/record oriented. Disadvantages are: requires arbitrary large buffers for arbitrary long lines relatively slow in the presence of short/normal lines sensitive to the current stdio buffering mode requires -z option to support NUL termination Processing instead a block at a time avoid such issues. UTF-8 at least is self synchronising, so after reading a block you just have to look at the last 3 bytes to know how many to append to the start of the next block. Pádraig.
