Re: [ast-developers] Fwd: Fwd: Where does FreeBSD tr -C differ from tr -c?

Glenn Fowler Wed, 05 Jun 2013 06:52:50 -0700

On Wed, 5 Jun 2013 12:51:18 +0200 Cedric Blancher wrote:
> On 5 June 2013 07:03, Glenn Fowler <[email protected]> wrote:
> >
> > I had posed a question to the posix austin group related to this
> > and failed to report back to ast-developers
> >
> > here is the relevant snippet, starting with a response from the group
> > and my comment
> >
> >>> Maybe what you're confusing is the concept of unassigned Unicode
> >>> codepoints (a Unicode concept irrelevant to C/POSIX) and invalid
> >>> wchar_t values or illegal multibyte sequences (a C/POSIX concept). As
> >>> far as C/POSIX is concerned, a multibyte sequence is legal if and only
> >>> if it corresponds to a wchar_t value via mbrtowc, and conversely, a
> >>> wchar_t value is a valid character if and only if it corresponds to a
> >>> multibyte character via wcrtomb. These operations should be inverses;
> >>> in particular they should be defined on each other's ranges.
> >>
> >> yes there is confusion started on some other threads which contained
> >> references to
> >>         int iswrune(wchar_t)
> >> which apparently tests for assigned codepoints
> >>
> >> what you just pointed out it is exactly what is needed for the POSIX tr
> >> implementation -- basically that unassigned codepoints do not come into 
> >> play
> >
> > basically the only tools an application has for:
> >         valid multibyte sequence is mbrtowc()
> >         valid wchar_t is wcrtomb()


> What about libast's optimized UTF-8 versions of mbrtowc() and
> wcrtomb()? They do not filter out unassigned code points, do they?
> Aside from that almost all mbrtowc() and wcrtomb() implementations for
> UTF-8 (and GBK/JIS too) are designed for speed and do NOT test whether
> a codepoint is currently assigned in Unicode or not. They delegate the
> problem to iswrune() if available or let the applications test whether
> the resulting wchar_t matches at least one isw<class>() or not.

> > iswrune() is a concept outside the scope of posix

> This is not correct. POSIX indirectly defines that a codepoint is only
> assigned if one or more of the POSIX isw<class>() functions returns a
> match. if none of the standard isw<class>() functions returns a match
> then the codepoint is not assigned. iswrune() is only a shortcut, as
> Roland's emulation code demonstrates.

nitpicking here
since posix allows an implementation to define extension isw*() classes
there is no portable way to define iswrune() from the outside of any 
implementation
by "outside the scope" I meant that, within the scope of posix and what it
demands for compliance, "invalid codepoint" is not mentioned

the only place "codepoint" is mentioned is in the rationale for pax describing
why they chose UTF-8 as the internal archive format codeset encoding - 
specifically
because a pax archive used for interchange must be "codepoint" agnostic and
encode all characters
(rationales are not part of the standard proper)

> PS: iswrune() is not specific to Unicode. It is used in the GBK and
> JIS locales to distinguish GBK/JIS versions too.

the "point" is that posix commands need only report "invalid character encoding"
(EILSEQ) via mbrtowc() and wcrtomb() or equivalent; there is no requirement for
any posix command that it report "invalid codepoint"

its nice that some implementations provide iswrune() to make it possible
to portably determine "invalid codepoint", but that has no bearing on
any posix compliant command implementation -- if any posix command 
implementation
were to fail on "invalid codepoint" it would be non-compliant

a command implementation could be extended via options to include "codepoint"
diagnostics, but it would be an extension

_______________________________________________
ast-developers mailing list
[email protected]
http://lists.research.att.com/mailman/listinfo/ast-developers

Re: [ast-developers] Fwd: Fwd: Where does FreeBSD tr -C differ from tr -c?

Reply via email to