On Wed, Jun 5, 2013 at 3:52 PM, Glenn Fowler <[email protected]> wrote: > On Wed, 5 Jun 2013 12:51:18 +0200 Cedric Blancher wrote: >> On 5 June 2013 07:03, Glenn Fowler <[email protected]> wrote: >> > I had posed a question to the posix austin group related to this >> > and failed to report back to ast-developers [snip] >> > iswrune() is a concept outside the scope of posix > >> This is not correct. POSIX indirectly defines that a codepoint is only >> assigned if one or more of the POSIX isw<class>() functions returns a >> match. if none of the standard isw<class>() functions returns a match >> then the codepoint is not assigned. iswrune() is only a shortcut, as >> Roland's emulation code demonstrates. > > nitpicking here > since posix allows an implementation to define extension isw*() classes > there is no portable way to define iswrune() from the outside of any > implementation
Erm... yes and no... "yes" ... |isw*()| is extensible... but all extensions so far (at least those I'm aware of on Solaris, AIX, HP/UX, Linux and FreeBSD) are "extra" (usually to provide extra language- or culture-specific help) and the same characters have matches in the |isw*()|-classes defined by POSIX, too... which means that emulating |iswrune()| the way I did is it least valid on these platforms (well... FreeBSD, MacOS X and the OpenSolaris-derived Illumos define |iswrune()| themselves...). > by "outside the scope" I meant that, within the scope of posix and what it > demands for compliance, "invalid codepoint" is not mentioned Erm... at least for Unicode and GB18030 the issue is not "invalid codepoint" ... it's "unassigned codepoint". The codepoint itself may be valid but has no assigned meaning... which also makes it "unsortable" ... which was AFAIK the FreeBSD rationale behind filtering unassigned codepoints out (the other issue is that "sorting" Unicode characters via |strxfrm()| is tricky in this case since unless the locale has defined a specific "sort order" the characters are sorted using their numeric codepoint value... which sorts even technically "unsortable" unassigned code points. Grrr...). > the only place "codepoint" is mentioned is in the rationale for pax describing > why they chose UTF-8 as the internal archive format codeset encoding - > specifically > because a pax archive used for interchange must be "codepoint" agnostic and > encode all characters > (rationales are not part of the standard proper) > >> PS: iswrune() is not specific to Unicode. It is used in the GBK and >> JIS locales to distinguish GBK/JIS versions too. > > the "point" is that posix commands need only report "invalid character > encoding" > (EILSEQ) via mbrtowc() and wcrtomb() or equivalent; there is no requirement > for > any posix command that it report "invalid codepoint" See above... s/invalid codepoint/unassigned codepoint/ ... |EILSEQ| won't be returned unless the codepoint is beyond the numeric limit for the matching Unicode standard... > its nice that some implementations provide iswrune() to make it possible > to portably determine "invalid codepoint", but that has no bearing on > any posix compliant command implementation -- if any posix command > implementation > were to fail on "invalid codepoint" it would be non-compliant > > a command implementation could be extended via options to include "codepoint" > diagnostics, but it would be an extension Erm... AFAIK we don't need a "diagnostic" ... AFAIK the wish here seems to be to "filter out" (maybe using an extra "tr" option) any characters which do not match either |iswrune()| (if available) or all of the |isw*()| functions defined by POSIX (maybe we shouldn't name this class [:rune:] in regex... maybe a better name is [:_posix_anychar:] ... leading '_' because it is non-standard (for now) and "posix_anychar" to describe it should be true if it matches any character class defined by POSIX). ---- Bye, Roland -- __ . . __ (o.\ \/ /.o) [email protected] \__\/\/__/ MPEG specialist, C&&JAVA&&Sun&&Unix programmer /O /==\ O\ TEL +49 641 3992797 (;O/ \/ \O;) _______________________________________________ ast-developers mailing list [email protected] http://lists.research.att.com/mailman/listinfo/ast-developers
