On Wed, 21 Nov 2012 15:02:16 +0100 Roland Mainz wrote:
> --20cf306f76b64af10004cf01ccad
> Content-Type: text/plain; charset=ISO-8859-1

> On Sun, Jul 15, 2012 at 4:07 AM, Roland Mainz <[email protected]> 
> wrote:
> > On Fri, Jul 13, 2012 at 3:58 PM, David Korn <[email protected]> wrote:
> >> cc:  [email protected]
> >> Subject: Re: [ast-developers] RFE: New "wc" option "-X" which counts 
> >> number of  bytes which do not constitute valid multibyte characters...
> >>
> >> Would -X automatically enable -c?
> >
> > Erm... -c is for plain bytes while -X should just count the number of
> > bytes not covered by -m/-C.
> >
> > AFAIK all "wc" counting options count independently (e.g. -c/-m/-w/-l
> > can all be used in one command line (at least with GNU "wc"... AST
> > "wc" doesn't like having both -c and -m at the same command line)) ...
> > -X would be an exception because i basically "feeds on the remainder"
> > of -m/-C ...
> >
> >> Would the output contain the other count or just the invalid character 
> >> count?
> >
> > It is the count of _bytes_ which do not make a valid multibyte
> > character (technically it can happen in the "C"/"POSIX" locales,
> > too... since both only cover bytes 0-127... making 128-255 invalid
> > character values).

> Attached (as "wc_count_invalidchars001.diff.txt") is a prototype patch
> which implements wc -X to count invalid (multibyte) characters.
> A possible testcase would look like this (erm... is the "7" correct ?):
> -- snip --
> $ LC_ALL=en_US.UTF-8 ~/bin/ksh -c 'builtin wc ; printf
> "a\xe1kkkk\xe2xLl\n" | wc -m -X -q'
>        7       2
> -- snip --

> Notes:
> - -X currently only works with -m/-C, e.g. when characters (not bytes)
> are being counted. -X could work with -c (=print byte count) when
> -m/-C can be enabled internally, too. This may be usefull even in
> single-byte locales since functions like |mbtowc()| should AFAIK
> complain in cases when a byte does not represent a valid character
> value (I'll test this later today)
> - It would be nice if (assuming the POSIX/SUS standards allow it) that
> both -c and -m/-C can be enabled at the same time. Is there anything
> which disallows this from the standard's side ?

I had started almost the same patch last night
a few problems in the invalid character logic were uncovered in the process
I added C.UTF-8 tests to src/cmd/builtin/wc.rt
the next alphs should be posted shortly

_______________________________________________
ast-developers mailing list
[email protected]
http://lists.research.att.com/mailman/listinfo/ast-developers

Reply via email to