On Wed, 21 Nov 2012 23:34:20 +0100 Roland Mainz wrote: > On Wed, Nov 21, 2012 at 10:50 PM, Glenn Fowler <[email protected]> wrote: > > On Wed, 21 Nov 2012 15:02:16 +0100 Roland Mainz wrote: > >> --20cf306f76b64af10004cf01ccad > >> Content-Type: text/plain; charset=ISO-8859-1 > > > >> On Sun, Jul 15, 2012 at 4:07 AM, Roland Mainz <[email protected]> > >> wrote: > >> > On Fri, Jul 13, 2012 at 3:58 PM, David Korn <[email protected]> > >> > wrote: > >> >> cc: [email protected] > >> >> Subject: Re: [ast-developers] RFE: New "wc" option "-X" which counts > >> >> number of bytes which do not constitute valid multibyte characters... > >> >> > >> >> Would -X automatically enable -c? > >> > > >> > Erm... -c is for plain bytes while -X should just count the number of > >> > bytes not covered by -m/-C. > >> > > >> > AFAIK all "wc" counting options count independently (e.g. -c/-m/-w/-l > >> > can all be used in one command line (at least with GNU "wc"... AST > >> > "wc" doesn't like having both -c and -m at the same command line)) ... > >> > -X would be an exception because i basically "feeds on the remainder" > >> > of -m/-C ... > >> > > >> >> Would the output contain the other count or just the invalid character > >> >> count? > >> > > >> > It is the count of _bytes_ which do not make a valid multibyte > >> > character (technically it can happen in the "C"/"POSIX" locales, > >> > too... since both only cover bytes 0-127... making 128-255 invalid > >> > character values). > > > >> Attached (as "wc_count_invalidchars001.diff.txt") is a prototype patch > >> which implements wc -X to count invalid (multibyte) characters. > >> A possible testcase would look like this (erm... is the "7" correct ?): > >> -- snip -- > >> $ LC_ALL=en_US.UTF-8 ~/bin/ksh -c 'builtin wc ; printf > >> "a\xe1kkkk\xe2xLl\n" | wc -m -X -q' > >> 7 2 > >> -- snip -- > > > >> Notes: > >> - -X currently only works with -m/-C, e.g. when characters (not bytes) > >> are being counted. -X could work with -c (=print byte count) when > >> -m/-C can be enabled internally, too. This may be usefull even in > >> single-byte locales since functions like |mbtowc()| should AFAIK > >> complain in cases when a byte does not represent a valid character > >> value (I'll test this later today) > >> - It would be nice if (assuming the POSIX/SUS standards allow it) that > >> both -c and -m/-C can be enabled at the same time. Is there anything > >> which disallows this from the standard's side ? > > > > I had started almost the same patch last night > > a few problems in the invalid character logic were uncovered in the process
> Yes... one thing I noticed was that $ wc -X -m -q ... # and $ wc -X -m > ... # produced different results depending on where the bytes were > injected. > > I added C.UTF-8 tests to src/cmd/builtin/wc.rt > > the next alphs should be posted shortly > Thanks... :-) > ---- > Bye, > Roland > P.S.: $ wc -X ... # can be used to test the iconv builtin and the > issue with $ typeset -L2 ... # vs. multibyte characters which occupy > more than one terminal cell (e.g. see > http://lists.research.att.com/pipermail/ast-developers/2012q4/002119.html > ("Problem with typeset -L variables and multibyte characters wider > than one terminal cell...")). yes, I haven't wrapped my brain completely around that one yet we have to be careful to differentiate #bytes vs #chars vs #print-widths typeset -L mentions "field width" and that probably means print-width which means that the -L code may need a review to make sure it measures print-widths and truncates by character rather than byte _______________________________________________ ast-developers mailing list [email protected] http://lists.research.att.com/mailman/listinfo/ast-developers
