On Thu, 19 Sep 2013 00:19:11 +0200 Cedric Blancher wrote:
> On 5 August 2013 21:35, Cedric Blancher <[email protected]> wrote:
> > On 22 July 2013 16:28, Glenn Fowler <[email protected]> wrote:
> >>
> >> On Mon, 22 Jul 2013 12:10:32 +0200 Cedric Blancher wrote:
> >>> On 10 June 2013 03:50, Glenn Fowler <[email protected]> wrote:
> >>> >
> >>> > On Mon, 10 Jun 2013 03:47:08 +0200 Roland Mainz wrote:
> >>> >> On Sun, Jun 9, 2013 at 4:44 AM, Glenn Fowler <[email protected]> 
> >>> >> wrote:
> >>> >> > I knew I would get into semantic trouble here
> >>> >> > I'm not complaining/deriding the efficacy of iswrune()
> >>> >> > only that it has no bearing on any posix compliant utility
> >>> >
> >>> >> OK... here is the question which bothers me:
> >>> >> tr -C does require to sort characters, right ? How do we sort
> >>> >> characters which do not have an assigned meaning ?
> >>> >
> >>> > strcoll()
> >>> >
> >>> >> > if anyone wants to start a discussion about new utility option(s)
> >>> >> > that rely on iswrune() and what ast utilities should be affected, 
> >>> >> > great
> >>> >> >
> >>> >> > for systems that do not supply iswrune() portability remains a big 
> >>> >> > issue,
> >>> >> > current practice notwithstanding -- it will always be an
> >>> >> > iffe|config game of catchup vs. the iw*() collection du jour
> >>> >
> >>> >> BTW: re |iswrune()| emulation... perl has the perl regex match
> >>> >> \p{Unassigned} ... which creates the same matches as this script
> >>> >> (assuming LC_ALL='en_US.UTF-8' and locales Unicode version matches the
> >>> >> perl unicode version):
> >>> >> -- snip --
> >>> >> set -o nounset
> >>> >
> >>> >> typeset -i16 i
> >>> >
> >>> >> for (( i=0 ; i < 0x10FFFF ; i++ )) ; do
> >>> >>       ch="${ printf "\u[${i/~(El)16#/}]" ; }"
> >>> >
> >>> >>       if [[ "$ch" !=
> >>> >> ~(Elr)[[:alpha:][:alnum:][:digit:][:print:][:cntrl:][:space:][:blank:][:punct:]]
> >>> >> ]] ; then
> >>> >>               printf "# match found: %q\n" "${i}"
> >>> >>       fi
> >>> >> done
> >>> >
> >>> >> print '# done.'
> >>> >> -- snip --
> >>> >
> >>> >> |iswrune()| or not... IMO it would be nice to have something like
> >>> >> \p{Unassigned} in normal egrep/xgrep regex, e.g. something like a
> >>> >> [:_unassigned:] character class...
> >>> >
> >>> > [:rune:] would be a fine name for that class
> >>
> >>> There's still no [:rune:] emulation in libast :(
> >>
> >> that looks simple enough
> >> but I'm not convinced its correct
> >> what about system and user defined classes
> >> (there are notes on the list about some for chinese characters -- I forget 
> >> the details)
> >
> > Maybe Roland can elaborate. He's an expert for such locales.
> >
> >> if those aren't handled then why provide a [:rune:] that might work maybe
> >
> > Chinese and Japanese locales have extra classes defined by the locale
> > data, but they are *always* "extra", i.e. the characters have matches
> > in the basic POSIX character classes but also match extra classes like
> > isphonogram() or is ideogram().

ast regex already handles the extra classes via the posix wctype() and 
iswctype() apis
if posix adds a "rune" class then ast will just work

> > Please, could we get [:rune:] and a --weed-out-non-runes option for
> > tr(1), please?
> >

> Please?

I still don't know how proposed rune interacts with codesets vs languages
note that all posix mb* and wc* apis deal with codesets independent of the 
language
where is the oracle that says "this is a rune" and what are its input parameters
and does it vary by language X codeset or just by codeset and how does one track
when the oracle changes its mind or a language changes its mind or when 
implementations
differ in what codepoint are represented

propose how to provide a wctype() and iswctype() like api for "rune" that ast 
could use
as an intercept in src/lib/libast/regex/regclass.c and then [[::rune:]] will
be visible everywhere in ast

_______________________________________________
ast-developers mailing list
[email protected]
http://lists.research.att.com/mailman/listinfo/ast-developers

Reply via email to