On 5 June 2013 07:03, Glenn Fowler <[email protected]> wrote:
>
> I had posed a question to the posix austin group related to this
> and failed to report back to ast-developers
>
> here is the relevant snippet, starting with a response from the group
> and my comment
>
>>> Maybe what you're confusing is the concept of unassigned Unicode
>>> codepoints (a Unicode concept irrelevant to C/POSIX) and invalid
>>> wchar_t values or illegal multibyte sequences (a C/POSIX concept). As
>>> far as C/POSIX is concerned, a multibyte sequence is legal if and only
>>> if it corresponds to a wchar_t value via mbrtowc, and conversely, a
>>> wchar_t value is a valid character if and only if it corresponds to a
>>> multibyte character via wcrtomb. These operations should be inverses;
>>> in particular they should be defined on each other's ranges.
>>
>> yes there is confusion started on some other threads which contained
>> references to
>>         int iswrune(wchar_t)
>> which apparently tests for assigned codepoints
>>
>> what you just pointed out it is exactly what is needed for the POSIX tr
>> implementation -- basically that unassigned codepoints do not come into play
>
> basically the only tools an application has for:
>         valid multibyte sequence is mbrtowc()
>         valid wchar_t is wcrtomb()

What about libast's optimized UTF-8 versions of mbrtowc() and
wcrtomb()? They do not filter out unassigned code points, do they?
Aside from that almost all mbrtowc() and wcrtomb() implementations for
UTF-8 (and GBK/JIS too) are designed for speed and do NOT test whether
a codepoint is currently assigned in Unicode or not. They delegate the
problem to iswrune() if available or let the applications test whether
the resulting wchar_t matches at least one isw<class>() or not.

> iswrune() is a concept outside the scope of posix

This is not correct. POSIX indirectly defines that a codepoint is only
assigned if one or more of the POSIX isw<class>() functions returns a
match. if none of the standard isw<class>() functions returns a match
then the codepoint is not assigned. iswrune() is only a shortcut, as
Roland's emulation code demonstrates.

PS: iswrune() is not specific to Unicode. It is used in the GBK and
JIS locales to distinguish GBK/JIS versions too.

Ced
-- 
Cedric Blancher <[email protected]>
Institute Pasteur
_______________________________________________
ast-developers mailing list
[email protected]
http://lists.research.att.com/mailman/listinfo/ast-developers

Reply via email to