Re: multibyte processing - handling invalid sequences (long)

Pádraig Brady Sat, 23 Jul 2016 03:52:06 -0700

On 22/07/16 04:23, Assaf Gordon wrote:
> Hello,
> 
>> On Jul 21, 2016, at 06:08, Pádraig Brady <[email protected]> wrote:
>> [...]
>> It seems like --normalization={NFKD,NFKD,NFC,NFD} functionality would
>> also be quite cohesive in such a util.
> 
> Attached an improved version with unicode normalization support.
> 
> Before continuing with other stuff (e.g. more tests, documentation, news, 
> etc.),
> it's worth discussing if this is the path to take (or if we want to add this 
> to each individual utility).
> Also, do we keep these options or modify them?
> e.g. 'uconv' uses different terminology for handling invalid sequences: stop, 
> skip, substitute, escape (corresponding to abort, discard, replace, recode 
> below).
> 
> To keep the implementation simple, unicode normalization requires UTF-8 
> locales - is this a valid requirement?
> 
> And of course, what about the name?
> 
> Comments welcomed,
>  - assaf
> 
> 
> 
> 
> Example (from 'Unicode Explained' book):
> ===========
> $ printf '\uFB01anc\u00E9\n'
> ﬁancé
> 
> $ printf '\uFB01anc\u00E9\n' | ./src/mbfix -n nfd | od -An -tx1c
>   ef  ac  81  61  6e  63  65  cc  81  0a
>    ?   ? 201   a   n   c   e   ? 201  \n
> 
> $ printf '\uFB01anc\u00E9\n' | ./src/mbfix -n nfc | od -An -tx1c
>   ef  ac  81  61  6e  63  c3  a9  0a
>    ?   ? 201   a   n   c   ?   ?  \n
> 
> $ printf '\uFB01anc\u00E9\n' | ./src/mbfix -n nfkd | od -An -tx1c
>   66  69  61  6e  63  65  cc  81  0a
>    f   i   a   n   c   e   ? 201  \n
> 
> $ printf '\uFB01anc\u00E9\n' | ./src/mbfix -n nfkc | od -An -tx1c
>   66  69  61  6e  63  c3  a9  0a
>    f   i   a   n   c   ?   ?  \n
> 
> $ ./src/mbfix --help
> Usage: ./src/mbfix [OPTION]... [FILE]...
> Fix and adjust multibyte character in files
> 
> Mandatory arguments to long options are mandatory for short options too.
>   -A, --abort          same as --policy=abort
>   -C, --recode         same as --policy=recode
>   -c, --check          validate input, no output
>   -D, --discard        same as --policy=discard
>   -n, --normalization=NORM
>                        apply unicode normalization NORM:, one of:
>                        nfd, nfc, nfkd, nfkc. Normalization requires
>                        UTF-8 locales.
>   -p, --policy=POLICY  invalid-input policy: discard, abort
>                        replace (default), recode
>   -R, --replace        same as --policy=replace
>       --replace-char=N
>                        with 'replace' policy, use unicode character N
>                        (default: 0xFFFD 'REPLACEMENT CHARACTER')
>       --recode-format=FMT
>                        with 'recode' policy, recode invalid octets
>                        using FMT printf-format (default: '<0x%02x>')
>   -v, --verbose        report location of invalid input
>   -z, --zero-terminated    line delimiter is NUL, not newline
>       --help     display this help and exit
>       --version  output version information and exit


I was wondering about the tool being line/record oriented.

Disadvantages are:
  requires arbitrary large buffers for arbitrary long lines
  relatively slow in the presence of short/normal lines
  sensitive to the current stdio buffering mode
  requires -z option to support NUL termination

Processing instead a block at a time avoid such issues.
UTF-8 at least is self synchronising, so after reading a block
you just have to look at the last 3 bytes to know
how many to append to the start of the next block.

Pádraig.

Re: multibyte processing - handling invalid sequences (long)

Reply via email to