Re: multibyte processing - handling invalid sequences (long)

Eric Blake Wed, 20 Jul 2016 06:05:26 -0700

On 07/20/2016 06:21 AM, Pádraig Brady wrote:

> It's worth considering having a separate (already existing?) util
> to fix data before processing. That could have options to:
>   drop invalid chars, replace with replacement char,
>   apply various http://unicode.org/reports/tr15/#Norm_Forms,
>   convert enclosed forms like ㊷ to 42 etc.
> I.E. we should avoid complicating each util where possible,
> and at least avoid having options on each util that could be
> hoisted to a more general util like above.
> 
> Silently dropping invalid characters probably isn't a great idea,
> and warnings to stderr is a bit messy and could be seen to contradict
> POSIX which suggests exiting with failure if anything output to stderr.
> A compromise might be to just replace invalid chars with
> the replacement character � and then include that in
> normal character processing, to make issues in input apparent.


Since there are several plausible error-handling methods (silently
discard invalid input, flag input as invalid with an error and no
further output, convert invalid input into replacement character and
proceed with output), all of which can be considered desirable in some
circumstances, I wonder if we should give ALL utilities a common
--encoding-error=POLICY option that allows runtime selection between the
three policies, and/or an environment variable that selects the default
policy in absence of a command line choice.

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org

signature.asc
Description: OpenPGP digital signature

Re: multibyte processing - handling invalid sequences (long)

Reply via email to