On Sun, Jun 9, 2013 at 4:44 AM, Glenn Fowler <[email protected]> wrote:
> On Sat, 8 Jun 2013 23:17:08 +0200 Roland Mainz wrote:
>> On Wed, Jun 5, 2013 at 3:52 PM, Glenn Fowler <[email protected]> wrote:
>> > On Wed, 5 Jun 2013 12:51:18 +0200 Cedric Blancher wrote:
>> >> On 5 June 2013 07:03, Glenn Fowler <[email protected]> wrote:
>> >> > I had posed a question to the posix austin group related to this
>> >> > and failed to report back to ast-developers
[snip]
>> > by "outside the scope" I meant that, within the scope of posix and what it
>> > demands for compliance, "invalid codepoint" is not mentioned
>
>> Erm... at least for Unicode and GB18030 the issue is not "invalid
>> codepoint" ... it's "unassigned codepoint". The codepoint itself may
>> be valid but has no assigned meaning... which also makes it
>> "unsortable" ... which was AFAIK the FreeBSD rationale behind
>> filtering unassigned codepoints out (the other issue is that "sorting"
>> Unicode characters via |strxfrm()| is tricky in this case since unless
>> the locale has defined a specific "sort order" the characters are
>> sorted using their numeric codepoint value... which sorts even
>> technically "unsortable" unassigned code points. Grrr...).
>
>> > the only place "codepoint" is mentioned is in the rationale for pax 
>> > describing
>> > why they chose UTF-8 as the internal archive format codeset encoding - 
>> > specifically
>> > because a pax archive used for interchange must be "codepoint" agnostic and
>> > encode all characters
>> > (rationales are not part of the standard proper)
>> >
>> >> PS: iswrune() is not specific to Unicode. It is used in the GBK and
>> >> JIS locales to distinguish GBK/JIS versions too.
>> >
>> > the "point" is that posix commands need only report "invalid character 
>> > encoding"
>> > (EILSEQ) via mbrtowc() and wcrtomb() or equivalent; there is no 
>> > requirement for
>> > any posix command that it report "invalid codepoint"
>
>> See above... s/invalid codepoint/unassigned codepoint/ ... |EILSEQ|
>> won't be returned unless the codepoint is beyond the numeric limit for
>> the matching Unicode standard...
>
>> > its nice that some implementations provide iswrune() to make it possible
>> > to portably determine "invalid codepoint", but that has no bearing on
>> > any posix compliant command implementation -- if any posix command 
>> > implementation
>> > were to fail on "invalid codepoint" it would be non-compliant
>> >
>> > a command implementation could be extended via options to include 
>> > "codepoint"
>> > diagnostics, but it would be an extension
>
>> Erm... AFAIK we don't need a "diagnostic" ... AFAIK the wish here
>> seems to be to "filter out" (maybe using an extra "tr" option) any
>> characters which do not match either |iswrune()| (if available) or all
>> of the |isw*()| functions defined by POSIX (maybe we shouldn't name
>> this class [:rune:] in regex... maybe a better name is
>> [:_posix_anychar:] ... leading '_' because it is non-standard (for
>> now) and "posix_anychar" to describe it should be true if it matches
>> any character class defined by POSIX).
[snip]
>
> I knew I would get into semantic trouble here
> I'm not complaining/deriding the efficacy of iswrune()
> only that it has no bearing on any posix compliant utility

OK... here is the question which bothers me:
tr -C does require to sort characters, right ? How do we sort
characters which do not have an assigned meaning ?

> if anyone wants to start a discussion about new utility option(s)
> that rely on iswrune() and what ast utilities should be affected, great
>
> for systems that do not supply iswrune() portability remains a big issue,
> current practice notwithstanding -- it will always be an
> iffe|config game of catchup vs. the iw*() collection du jour

BTW: re |iswrune()| emulation... perl has the perl regex match
\p{Unassigned} ... which creates the same matches as this script
(assuming LC_ALL='en_US.UTF-8' and locales Unicode version matches the
perl unicode version):
-- snip --
set -o nounset

typeset -i16 i

for (( i=0 ; i < 0x10FFFF ; i++ )) ; do
        ch="${ printf "\u[${i/~(El)16#/}]" ; }"

        if [[ "$ch" !=
~(Elr)[[:alpha:][:alnum:][:digit:][:print:][:cntrl:][:space:][:blank:][:punct:]]
]] ; then
                printf "# match found: %q\n" "${i}"
        fi
done

print '# done.'
-- snip --

|iswrune()| or not... IMO it would be nice to have something like
\p{Unassigned} in normal egrep/xgrep regex, e.g. something like a
[:_unassigned:] character class...

----

Bye,
Roland

-- 
  __ .  . __
 (o.\ \/ /.o) [email protected]
  \__\/\/__/  MPEG specialist, C&&JAVA&&Sun&&Unix programmer
  /O /==\ O\  TEL +49 641 3992797
 (;O/ \/ \O;)
_______________________________________________
ast-developers mailing list
[email protected]
http://lists.research.att.com/mailman/listinfo/ast-developers

Reply via email to