Re: multibyte processing - handling invalid sequences (long)

Pádraig Brady Fri, 22 Jul 2016 04:49:51 -0700

On 22/07/16 04:23, Assaf Gordon wrote:
> Hello,
> 
>> On Jul 21, 2016, at 06:08, Pádraig Brady <[email protected]> wrote:
>> [...]
>> It seems like --normalization={NFKD,NFKD,NFC,NFD} functionality would
>> also be quite cohesive in such a util.
> 
> Attached an improved version with unicode normalization support.


Wow, very nice.

> Before continuing with other stuff (e.g. more tests, documentation, news, 
> etc.),
> it's worth discussing if this is the path to take (or if we want to add this 
> to each individual utility).

I'm not sure, but it would be nice as I said if we could get away with 
"replace" mode in other utils.
By having a separate util, it follows the idea of validating/transforming input 
as early as possible
so as to simplify the rest of the system.  Also it follows the idea that if 
something can be
done separately it should be done so.

> Also, do we keep these options or modify them?
> e.g. 'uconv' uses different terminology for handling invalid sequences: stop, 
> skip, substitute, escape (corresponding to abort, discard, replace, recode 
> below).

Doesn't really matter.
I find your naming slightly more descriptive.

> To keep the implementation simple, unicode normalization requires UTF-8 
> locales - is this a valid requirement?

Given how prevalent utf8 is I think this is fine.
It other tools if there is an option we should also tune for utf-8 input.

> And of course, what about the name?

I've a slight preference for unorm

thanks!
Pádraig

Re: multibyte processing - handling invalid sequences (long)

Reply via email to