I knew I would get into semantic trouble here I'm not complaining/deriding the efficacy of iswrune() only that it has no bearing on any posix compliant utility
if anyone wants to start a discussion about new utility option(s) that rely on iswrune() and what ast utilities should be affected, great for systems that do not supply iswrune() portability remains a big issue, current practice notwithstanding -- it will always be an iffe|config game of catchup vs. the iw*() collection du jour On Sat, 8 Jun 2013 23:17:08 +0200 Roland Mainz wrote: > On Wed, Jun 5, 2013 at 3:52 PM, Glenn Fowler <[email protected]> wrote: > > On Wed, 5 Jun 2013 12:51:18 +0200 Cedric Blancher wrote: > >> On 5 June 2013 07:03, Glenn Fowler <[email protected]> wrote: > >> > I had posed a question to the posix austin group related to this > >> > and failed to report back to ast-developers > [snip] > >> > iswrune() is a concept outside the scope of posix > > > >> This is not correct. POSIX indirectly defines that a codepoint is only > >> assigned if one or more of the POSIX isw<class>() functions returns a > >> match. if none of the standard isw<class>() functions returns a match > >> then the codepoint is not assigned. iswrune() is only a shortcut, as > >> Roland's emulation code demonstrates. > > > > nitpicking here > > since posix allows an implementation to define extension isw*() classes > > there is no portable way to define iswrune() from the outside of any > > implementation > Erm... yes and no... "yes" ... |isw*()| is extensible... but all > extensions so far (at least those I'm aware of on Solaris, AIX, HP/UX, > Linux and FreeBSD) are "extra" (usually to provide extra language- or > culture-specific help) and the same characters have matches in the > |isw*()|-classes defined by POSIX, too... which means that emulating > |iswrune()| the way I did is it least valid on these platforms > (well... FreeBSD, MacOS X and the OpenSolaris-derived Illumos define > |iswrune()| themselves...). > > by "outside the scope" I meant that, within the scope of posix and what it > > demands for compliance, "invalid codepoint" is not mentioned > Erm... at least for Unicode and GB18030 the issue is not "invalid > codepoint" ... it's "unassigned codepoint". The codepoint itself may > be valid but has no assigned meaning... which also makes it > "unsortable" ... which was AFAIK the FreeBSD rationale behind > filtering unassigned codepoints out (the other issue is that "sorting" > Unicode characters via |strxfrm()| is tricky in this case since unless > the locale has defined a specific "sort order" the characters are > sorted using their numeric codepoint value... which sorts even > technically "unsortable" unassigned code points. Grrr...). > > the only place "codepoint" is mentioned is in the rationale for pax > > describing > > why they chose UTF-8 as the internal archive format codeset encoding - > > specifically > > because a pax archive used for interchange must be "codepoint" agnostic and > > encode all characters > > (rationales are not part of the standard proper) > > > >> PS: iswrune() is not specific to Unicode. It is used in the GBK and > >> JIS locales to distinguish GBK/JIS versions too. > > > > the "point" is that posix commands need only report "invalid character > > encoding" > > (EILSEQ) via mbrtowc() and wcrtomb() or equivalent; there is no requirement > > for > > any posix command that it report "invalid codepoint" > See above... s/invalid codepoint/unassigned codepoint/ ... |EILSEQ| > won't be returned unless the codepoint is beyond the numeric limit for > the matching Unicode standard... > > its nice that some implementations provide iswrune() to make it possible > > to portably determine "invalid codepoint", but that has no bearing on > > any posix compliant command implementation -- if any posix command > > implementation > > were to fail on "invalid codepoint" it would be non-compliant > > > > a command implementation could be extended via options to include > > "codepoint" > > diagnostics, but it would be an extension > Erm... AFAIK we don't need a "diagnostic" ... AFAIK the wish here > seems to be to "filter out" (maybe using an extra "tr" option) any > characters which do not match either |iswrune()| (if available) or all > of the |isw*()| functions defined by POSIX (maybe we shouldn't name > this class [:rune:] in regex... maybe a better name is > [:_posix_anychar:] ... leading '_' because it is non-standard (for > now) and "posix_anychar" to describe it should be true if it matches > any character class defined by POSIX). > ---- > Bye, > Roland > -- > __ . . __ > (o.\ \/ /.o) [email protected] > \__\/\/__/ MPEG specialist, C&&JAVA&&Sun&&Unix programmer > /O /==\ O\ TEL +49 641 3992797 > (;O/ \/ \O;) _______________________________________________ ast-developers mailing list [email protected] http://lists.research.att.com/mailman/listinfo/ast-developers
