Re: [ast-developers] Fwd: Fwd: Where does FreeBSD tr -C differ from tr -c?

Glenn Fowler Sat, 08 Jun 2013 19:45:26 -0700

I knew I would get into semantic trouble here
I'm not complaining/deriding the efficacy of iswrune()
only that it has no bearing on any posix compliant utility


if anyone wants to start a discussion about new utility option(s)
that rely on iswrune() and what ast utilities should be affected, great

for systems that do not supply iswrune() portability remains a big issue,
current practice notwithstanding -- it will always be an
iffe|config game of catchup vs. the iw*() collection du jour

On Sat, 8 Jun 2013 23:17:08 +0200 Roland Mainz wrote:
> On Wed, Jun 5, 2013 at 3:52 PM, Glenn Fowler <[email protected]> wrote:
> > On Wed, 5 Jun 2013 12:51:18 +0200 Cedric Blancher wrote:
> >> On 5 June 2013 07:03, Glenn Fowler <[email protected]> wrote:
> >> > I had posed a question to the posix austin group related to this
> >> > and failed to report back to ast-developers
> [snip]
> >> > iswrune() is a concept outside the scope of posix
> >
> >> This is not correct. POSIX indirectly defines that a codepoint is only
> >> assigned if one or more of the POSIX isw<class>() functions returns a
> >> match. if none of the standard isw<class>() functions returns a match
> >> then the codepoint is not assigned. iswrune() is only a shortcut, as
> >> Roland's emulation code demonstrates.
> >
> > nitpicking here
> > since posix allows an implementation to define extension isw*() classes
> > there is no portable way to define iswrune() from the outside of any 
> > implementation

> Erm... yes and no... "yes" ... |isw*()| is extensible... but all
> extensions so far (at least those I'm aware of on Solaris, AIX, HP/UX,
> Linux and FreeBSD) are "extra" (usually to provide extra language- or
> culture-specific help) and the same characters have matches in the
> |isw*()|-classes defined by POSIX, too... which means that emulating
> |iswrune()| the way I did is it least valid on these platforms
> (well... FreeBSD, MacOS X and the OpenSolaris-derived Illumos define
> |iswrune()| themselves...).

> > by "outside the scope" I meant that, within the scope of posix and what it
> > demands for compliance, "invalid codepoint" is not mentioned

> Erm... at least for Unicode and GB18030 the issue is not "invalid
> codepoint" ... it's "unassigned codepoint". The codepoint itself may
> be valid but has no assigned meaning... which also makes it
> "unsortable" ... which was AFAIK the FreeBSD rationale behind
> filtering unassigned codepoints out (the other issue is that "sorting"
> Unicode characters via |strxfrm()| is tricky in this case since unless
> the locale has defined a specific "sort order" the characters are
> sorted using their numeric codepoint value... which sorts even
> technically "unsortable" unassigned code points. Grrr...).

> > the only place "codepoint" is mentioned is in the rationale for pax 
> > describing
> > why they chose UTF-8 as the internal archive format codeset encoding - 
> > specifically
> > because a pax archive used for interchange must be "codepoint" agnostic and
> > encode all characters
> > (rationales are not part of the standard proper)
> >
> >> PS: iswrune() is not specific to Unicode. It is used in the GBK and
> >> JIS locales to distinguish GBK/JIS versions too.
> >
> > the "point" is that posix commands need only report "invalid character 
> > encoding"
> > (EILSEQ) via mbrtowc() and wcrtomb() or equivalent; there is no requirement 
> > for
> > any posix command that it report "invalid codepoint"

> See above... s/invalid codepoint/unassigned codepoint/ ... |EILSEQ|
> won't be returned unless the codepoint is beyond the numeric limit for
> the matching Unicode standard...

> > its nice that some implementations provide iswrune() to make it possible
> > to portably determine "invalid codepoint", but that has no bearing on
> > any posix compliant command implementation -- if any posix command 
> > implementation
> > were to fail on "invalid codepoint" it would be non-compliant
> >
> > a command implementation could be extended via options to include 
> > "codepoint"
> > diagnostics, but it would be an extension

> Erm... AFAIK we don't need a "diagnostic" ... AFAIK the wish here
> seems to be to "filter out" (maybe using an extra "tr" option) any
> characters which do not match either |iswrune()| (if available) or all
> of the |isw*()| functions defined by POSIX (maybe we shouldn't name
> this class [:rune:] in regex... maybe a better name is
> [:_posix_anychar:] ... leading '_' because it is non-standard (for
> now) and "posix_anychar" to describe it should be true if it matches
> any character class defined by POSIX).

> ----

> Bye,
> Roland

> -- 
>   __ .  . __
>  (o.\ \/ /.o) [email protected]
>   \__\/\/__/  MPEG specialist, C&&JAVA&&Sun&&Unix programmer
>   /O /==\ O\  TEL +49 641 3992797
>  (;O/ \/ \O;)

_______________________________________________
ast-developers mailing list
[email protected]
http://lists.research.att.com/mailman/listinfo/ast-developers

Re: [ast-developers] Fwd: Fwd: Where does FreeBSD tr -C differ from tr -c?

Reply via email to