Re: [ast-developers] RFE: New "wc" option "-X" which counts number of bytes which do not constitute valid multibyte characters...

Glenn Fowler Wed, 21 Nov 2012 14:43:24 -0800

On Wed, 21 Nov 2012 23:34:20 +0100 Roland Mainz wrote:
> On Wed, Nov 21, 2012 at 10:50 PM, Glenn Fowler <[email protected]> wrote:
> > On Wed, 21 Nov 2012 15:02:16 +0100 Roland Mainz wrote:
> >> --20cf306f76b64af10004cf01ccad
> >> Content-Type: text/plain; charset=ISO-8859-1
> >
> >> On Sun, Jul 15, 2012 at 4:07 AM, Roland Mainz <[email protected]> 
> >> wrote:
> >> > On Fri, Jul 13, 2012 at 3:58 PM, David Korn <[email protected]> 
> >> > wrote:
> >> >> cc:  [email protected]
> >> >> Subject: Re: [ast-developers] RFE: New "wc" option "-X" which counts 
> >> >> number of  bytes which do not constitute valid multibyte characters...
> >> >>
> >> >> Would -X automatically enable -c?
> >> >
> >> > Erm... -c is for plain bytes while -X should just count the number of
> >> > bytes not covered by -m/-C.
> >> >
> >> > AFAIK all "wc" counting options count independently (e.g. -c/-m/-w/-l
> >> > can all be used in one command line (at least with GNU "wc"... AST
> >> > "wc" doesn't like having both -c and -m at the same command line)) ...
> >> > -X would be an exception because i basically "feeds on the remainder"
> >> > of -m/-C ...
> >> >
> >> >> Would the output contain the other count or just the invalid character 
> >> >> count?
> >> >
> >> > It is the count of _bytes_ which do not make a valid multibyte
> >> > character (technically it can happen in the "C"/"POSIX" locales,
> >> > too... since both only cover bytes 0-127... making 128-255 invalid
> >> > character values).
> >
> >> Attached (as "wc_count_invalidchars001.diff.txt") is a prototype patch
> >> which implements wc -X to count invalid (multibyte) characters.
> >> A possible testcase would look like this (erm... is the "7" correct ?):
> >> -- snip --
> >> $ LC_ALL=en_US.UTF-8 ~/bin/ksh -c 'builtin wc ; printf
> >> "a\xe1kkkk\xe2xLl\n" | wc -m -X -q'
> >>        7       2
> >> -- snip --
> >
> >> Notes:
> >> - -X currently only works with -m/-C, e.g. when characters (not bytes)
> >> are being counted. -X could work with -c (=print byte count) when
> >> -m/-C can be enabled internally, too. This may be usefull even in
> >> single-byte locales since functions like |mbtowc()| should AFAIK
> >> complain in cases when a byte does not represent a valid character
> >> value (I'll test this later today)
> >> - It would be nice if (assuming the POSIX/SUS standards allow it) that
> >> both -c and -m/-C can be enabled at the same time. Is there anything
> >> which disallows this from the standard's side ?
> >
> > I had started almost the same patch last night
> > a few problems in the invalid character logic were uncovered in the process


> Yes... one thing I noticed was that $ wc -X -m -q ... # and $ wc -X -m
> ... # produced different results depending on where the bytes were
> injected.

> > I added C.UTF-8 tests to src/cmd/builtin/wc.rt
> > the next alphs should be posted shortly

> Thanks... :-)

> ----

> Bye,
> Roland

> P.S.: $ wc -X ... # can be used to test the iconv builtin and the
> issue with $ typeset -L2 ... # vs. multibyte characters which occupy
> more than one terminal cell (e.g. see
> http://lists.research.att.com/pipermail/ast-developers/2012q4/002119.html
> ("Problem with typeset -L variables and multibyte characters wider
> than one terminal cell...")).

yes, I haven't wrapped my brain completely around that one yet
we have to be careful to differentiate #bytes vs #chars vs #print-widths
typeset -L mentions "field width" and that probably means print-width
which means that the -L code may need a review to make sure it measures
print-widths and truncates by character rather than byte

_______________________________________________
ast-developers mailing list
[email protected]
http://lists.research.att.com/mailman/listinfo/ast-developers

Re: [ast-developers] RFE: New "wc" option "-X" which counts number of bytes which do not constitute valid multibyte characters...

Reply via email to