On 5 June 2013 07:03, Glenn Fowler <[email protected]> wrote: > > I had posed a question to the posix austin group related to this > and failed to report back to ast-developers > > here is the relevant snippet, starting with a response from the group > and my comment > >>> Maybe what you're confusing is the concept of unassigned Unicode >>> codepoints (a Unicode concept irrelevant to C/POSIX) and invalid >>> wchar_t values or illegal multibyte sequences (a C/POSIX concept). As >>> far as C/POSIX is concerned, a multibyte sequence is legal if and only >>> if it corresponds to a wchar_t value via mbrtowc, and conversely, a >>> wchar_t value is a valid character if and only if it corresponds to a >>> multibyte character via wcrtomb. These operations should be inverses; >>> in particular they should be defined on each other's ranges. >> >> yes there is confusion started on some other threads which contained >> references to >> int iswrune(wchar_t) >> which apparently tests for assigned codepoints >> >> what you just pointed out it is exactly what is needed for the POSIX tr >> implementation -- basically that unassigned codepoints do not come into play > > basically the only tools an application has for: > valid multibyte sequence is mbrtowc() > valid wchar_t is wcrtomb()
What about libast's optimized UTF-8 versions of mbrtowc() and wcrtomb()? They do not filter out unassigned code points, do they? Aside from that almost all mbrtowc() and wcrtomb() implementations for UTF-8 (and GBK/JIS too) are designed for speed and do NOT test whether a codepoint is currently assigned in Unicode or not. They delegate the problem to iswrune() if available or let the applications test whether the resulting wchar_t matches at least one isw<class>() or not. > iswrune() is a concept outside the scope of posix This is not correct. POSIX indirectly defines that a codepoint is only assigned if one or more of the POSIX isw<class>() functions returns a match. if none of the standard isw<class>() functions returns a match then the codepoint is not assigned. iswrune() is only a shortcut, as Roland's emulation code demonstrates. PS: iswrune() is not specific to Unicode. It is used in the GBK and JIS locales to distinguish GBK/JIS versions too. Ced -- Cedric Blancher <[email protected]> Institute Pasteur _______________________________________________ ast-developers mailing list [email protected] http://lists.research.att.com/mailman/listinfo/ast-developers
