On Wed, 21 Nov 2012 15:02:16 +0100 Roland Mainz wrote: > --20cf306f76b64af10004cf01ccad > Content-Type: text/plain; charset=ISO-8859-1
> On Sun, Jul 15, 2012 at 4:07 AM, Roland Mainz <[email protected]> > wrote: > > On Fri, Jul 13, 2012 at 3:58 PM, David Korn <[email protected]> wrote: > >> cc: [email protected] > >> Subject: Re: [ast-developers] RFE: New "wc" option "-X" which counts > >> number of bytes which do not constitute valid multibyte characters... > >> > >> Would -X automatically enable -c? > > > > Erm... -c is for plain bytes while -X should just count the number of > > bytes not covered by -m/-C. > > > > AFAIK all "wc" counting options count independently (e.g. -c/-m/-w/-l > > can all be used in one command line (at least with GNU "wc"... AST > > "wc" doesn't like having both -c and -m at the same command line)) ... > > -X would be an exception because i basically "feeds on the remainder" > > of -m/-C ... > > > >> Would the output contain the other count or just the invalid character > >> count? > > > > It is the count of _bytes_ which do not make a valid multibyte > > character (technically it can happen in the "C"/"POSIX" locales, > > too... since both only cover bytes 0-127... making 128-255 invalid > > character values). > Attached (as "wc_count_invalidchars001.diff.txt") is a prototype patch > which implements wc -X to count invalid (multibyte) characters. > A possible testcase would look like this (erm... is the "7" correct ?): > -- snip -- > $ LC_ALL=en_US.UTF-8 ~/bin/ksh -c 'builtin wc ; printf > "a\xe1kkkk\xe2xLl\n" | wc -m -X -q' > 7 2 > -- snip -- > Notes: > - -X currently only works with -m/-C, e.g. when characters (not bytes) > are being counted. -X could work with -c (=print byte count) when > -m/-C can be enabled internally, too. This may be usefull even in > single-byte locales since functions like |mbtowc()| should AFAIK > complain in cases when a byte does not represent a valid character > value (I'll test this later today) > - It would be nice if (assuming the POSIX/SUS standards allow it) that > both -c and -m/-C can be enabled at the same time. Is there anything > which disallows this from the standard's side ? I had started almost the same patch last night a few problems in the invalid character logic were uncovered in the process I added C.UTF-8 tests to src/cmd/builtin/wc.rt the next alphs should be posted shortly _______________________________________________ ast-developers mailing list [email protected] http://lists.research.att.com/mailman/listinfo/ast-developers
