On Wed, Nov 21, 2012 at 10:50 PM, Glenn Fowler <[email protected]> wrote:
> On Wed, 21 Nov 2012 15:02:16 +0100 Roland Mainz wrote:
>> --20cf306f76b64af10004cf01ccad
>> Content-Type: text/plain; charset=ISO-8859-1
>
>> On Sun, Jul 15, 2012 at 4:07 AM, Roland Mainz <[email protected]> 
>> wrote:
>> > On Fri, Jul 13, 2012 at 3:58 PM, David Korn <[email protected]> wrote:
>> >> cc:  [email protected]
>> >> Subject: Re: [ast-developers] RFE: New "wc" option "-X" which counts 
>> >> number of  bytes which do not constitute valid multibyte characters...
>> >>
>> >> Would -X automatically enable -c?
>> >
>> > Erm... -c is for plain bytes while -X should just count the number of
>> > bytes not covered by -m/-C.
>> >
>> > AFAIK all "wc" counting options count independently (e.g. -c/-m/-w/-l
>> > can all be used in one command line (at least with GNU "wc"... AST
>> > "wc" doesn't like having both -c and -m at the same command line)) ...
>> > -X would be an exception because i basically "feeds on the remainder"
>> > of -m/-C ...
>> >
>> >> Would the output contain the other count or just the invalid character 
>> >> count?
>> >
>> > It is the count of _bytes_ which do not make a valid multibyte
>> > character (technically it can happen in the "C"/"POSIX" locales,
>> > too... since both only cover bytes 0-127... making 128-255 invalid
>> > character values).
>
>> Attached (as "wc_count_invalidchars001.diff.txt") is a prototype patch
>> which implements wc -X to count invalid (multibyte) characters.
>> A possible testcase would look like this (erm... is the "7" correct ?):
>> -- snip --
>> $ LC_ALL=en_US.UTF-8 ~/bin/ksh -c 'builtin wc ; printf
>> "a\xe1kkkk\xe2xLl\n" | wc -m -X -q'
>>        7       2
>> -- snip --
>
>> Notes:
>> - -X currently only works with -m/-C, e.g. when characters (not bytes)
>> are being counted. -X could work with -c (=print byte count) when
>> -m/-C can be enabled internally, too. This may be usefull even in
>> single-byte locales since functions like |mbtowc()| should AFAIK
>> complain in cases when a byte does not represent a valid character
>> value (I'll test this later today)
>> - It would be nice if (assuming the POSIX/SUS standards allow it) that
>> both -c and -m/-C can be enabled at the same time. Is there anything
>> which disallows this from the standard's side ?
>
> I had started almost the same patch last night
> a few problems in the invalid character logic were uncovered in the process

Yes... one thing I noticed was that $ wc -X -m -q ... # and $ wc -X -m
... # produced different results depending on where the bytes were
injected.

> I added C.UTF-8 tests to src/cmd/builtin/wc.rt
> the next alphs should be posted shortly

Thanks... :-)

----

Bye,
Roland

P.S.: $ wc -X ... # can be used to test the iconv builtin and the
issue with $ typeset -L2 ... # vs. multibyte characters which occupy
more than one terminal cell (e.g. see
http://lists.research.att.com/pipermail/ast-developers/2012q4/002119.html
("Problem with typeset -L variables and multibyte characters wider
than one terminal cell...")).

-- 
  __ .  . __
 (o.\ \/ /.o) [email protected]
  \__\/\/__/  MPEG specialist, C&&JAVA&&Sun&&Unix programmer
  /O /==\ O\  TEL +49 641 3992797
 (;O/ \/ \O;)
_______________________________________________
ast-developers mailing list
[email protected]
http://lists.research.att.com/mailman/listinfo/ast-developers

Reply via email to