Re: [ast-developers] Fwd: Fwd: Where does FreeBSD tr -C differ from tr -c?

Roland Mainz Sat, 08 Jun 2013 14:18:01 -0700

On Wed, Jun 5, 2013 at 3:52 PM, Glenn Fowler <[email protected]> wrote:
> On Wed, 5 Jun 2013 12:51:18 +0200 Cedric Blancher wrote:
>> On 5 June 2013 07:03, Glenn Fowler <[email protected]> wrote:
>> > I had posed a question to the posix austin group related to this
>> > and failed to report back to ast-developers
[snip]
>> > iswrune() is a concept outside the scope of posix
>
>> This is not correct. POSIX indirectly defines that a codepoint is only
>> assigned if one or more of the POSIX isw<class>() functions returns a
>> match. if none of the standard isw<class>() functions returns a match
>> then the codepoint is not assigned. iswrune() is only a shortcut, as
>> Roland's emulation code demonstrates.
>
> nitpicking here
> since posix allows an implementation to define extension isw*() classes
> there is no portable way to define iswrune() from the outside of any 
> implementation


Erm... yes and no... "yes" ... |isw*()| is extensible... but all
extensions so far (at least those I'm aware of on Solaris, AIX, HP/UX,
Linux and FreeBSD) are "extra" (usually to provide extra language- or
culture-specific help) and the same characters have matches in the
|isw*()|-classes defined by POSIX, too... which means that emulating
|iswrune()| the way I did is it least valid on these platforms
(well... FreeBSD, MacOS X and the OpenSolaris-derived Illumos define
|iswrune()| themselves...).

> by "outside the scope" I meant that, within the scope of posix and what it
> demands for compliance, "invalid codepoint" is not mentioned

Erm... at least for Unicode and GB18030 the issue is not "invalid
codepoint" ... it's "unassigned codepoint". The codepoint itself may
be valid but has no assigned meaning... which also makes it
"unsortable" ... which was AFAIK the FreeBSD rationale behind
filtering unassigned codepoints out (the other issue is that "sorting"
Unicode characters via |strxfrm()| is tricky in this case since unless
the locale has defined a specific "sort order" the characters are
sorted using their numeric codepoint value... which sorts even
technically "unsortable" unassigned code points. Grrr...).

> the only place "codepoint" is mentioned is in the rationale for pax describing
> why they chose UTF-8 as the internal archive format codeset encoding - 
> specifically
> because a pax archive used for interchange must be "codepoint" agnostic and
> encode all characters
> (rationales are not part of the standard proper)
>
>> PS: iswrune() is not specific to Unicode. It is used in the GBK and
>> JIS locales to distinguish GBK/JIS versions too.
>
> the "point" is that posix commands need only report "invalid character 
> encoding"
> (EILSEQ) via mbrtowc() and wcrtomb() or equivalent; there is no requirement 
> for
> any posix command that it report "invalid codepoint"

See above... s/invalid codepoint/unassigned codepoint/ ... |EILSEQ|
won't be returned unless the codepoint is beyond the numeric limit for
the matching Unicode standard...

> its nice that some implementations provide iswrune() to make it possible
> to portably determine "invalid codepoint", but that has no bearing on
> any posix compliant command implementation -- if any posix command 
> implementation
> were to fail on "invalid codepoint" it would be non-compliant
>
> a command implementation could be extended via options to include "codepoint"
> diagnostics, but it would be an extension

Erm... AFAIK we don't need a "diagnostic" ... AFAIK the wish here
seems to be to "filter out" (maybe using an extra "tr" option) any
characters which do not match either |iswrune()| (if available) or all
of the |isw*()| functions defined by POSIX (maybe we shouldn't name
this class [:rune:] in regex... maybe a better name is
[:_posix_anychar:] ... leading '_' because it is non-standard (for
now) and "posix_anychar" to describe it should be true if it matches
any character class defined by POSIX).

----

Bye,
Roland

-- 
  __ .  . __
 (o.\ \/ /.o) [email protected]
  \__\/\/__/  MPEG specialist, C&&JAVA&&Sun&&Unix programmer
  /O /==\ O\  TEL +49 641 3992797
 (;O/ \/ \O;)
_______________________________________________
ast-developers mailing list
[email protected]
http://lists.research.att.com/mailman/listinfo/ast-developers

Re: [ast-developers] Fwd: Fwd: Where does FreeBSD tr -C differ from tr -c?

Reply via email to