On Wed, Nov 21, 2012 at 10:50 PM, Glenn Fowler <[email protected]> wrote: > On Wed, 21 Nov 2012 15:02:16 +0100 Roland Mainz wrote: >> --20cf306f76b64af10004cf01ccad >> Content-Type: text/plain; charset=ISO-8859-1 > >> On Sun, Jul 15, 2012 at 4:07 AM, Roland Mainz <[email protected]> >> wrote: >> > On Fri, Jul 13, 2012 at 3:58 PM, David Korn <[email protected]> wrote: >> >> cc: [email protected] >> >> Subject: Re: [ast-developers] RFE: New "wc" option "-X" which counts >> >> number of bytes which do not constitute valid multibyte characters... >> >> >> >> Would -X automatically enable -c? >> > >> > Erm... -c is for plain bytes while -X should just count the number of >> > bytes not covered by -m/-C. >> > >> > AFAIK all "wc" counting options count independently (e.g. -c/-m/-w/-l >> > can all be used in one command line (at least with GNU "wc"... AST >> > "wc" doesn't like having both -c and -m at the same command line)) ... >> > -X would be an exception because i basically "feeds on the remainder" >> > of -m/-C ... >> > >> >> Would the output contain the other count or just the invalid character >> >> count? >> > >> > It is the count of _bytes_ which do not make a valid multibyte >> > character (technically it can happen in the "C"/"POSIX" locales, >> > too... since both only cover bytes 0-127... making 128-255 invalid >> > character values). > >> Attached (as "wc_count_invalidchars001.diff.txt") is a prototype patch >> which implements wc -X to count invalid (multibyte) characters. >> A possible testcase would look like this (erm... is the "7" correct ?): >> -- snip -- >> $ LC_ALL=en_US.UTF-8 ~/bin/ksh -c 'builtin wc ; printf >> "a\xe1kkkk\xe2xLl\n" | wc -m -X -q' >> 7 2 >> -- snip -- > >> Notes: >> - -X currently only works with -m/-C, e.g. when characters (not bytes) >> are being counted. -X could work with -c (=print byte count) when >> -m/-C can be enabled internally, too. This may be usefull even in >> single-byte locales since functions like |mbtowc()| should AFAIK >> complain in cases when a byte does not represent a valid character >> value (I'll test this later today) >> - It would be nice if (assuming the POSIX/SUS standards allow it) that >> both -c and -m/-C can be enabled at the same time. Is there anything >> which disallows this from the standard's side ? > > I had started almost the same patch last night > a few problems in the invalid character logic were uncovered in the process
Yes... one thing I noticed was that $ wc -X -m -q ... # and $ wc -X -m ... # produced different results depending on where the bytes were injected. > I added C.UTF-8 tests to src/cmd/builtin/wc.rt > the next alphs should be posted shortly Thanks... :-) ---- Bye, Roland P.S.: $ wc -X ... # can be used to test the iconv builtin and the issue with $ typeset -L2 ... # vs. multibyte characters which occupy more than one terminal cell (e.g. see http://lists.research.att.com/pipermail/ast-developers/2012q4/002119.html ("Problem with typeset -L variables and multibyte characters wider than one terminal cell...")). -- __ . . __ (o.\ \/ /.o) [email protected] \__\/\/__/ MPEG specialist, C&&JAVA&&Sun&&Unix programmer /O /==\ O\ TEL +49 641 3992797 (;O/ \/ \O;) _______________________________________________ ast-developers mailing list [email protected] http://lists.research.att.com/mailman/listinfo/ast-developers
