Edmund GRIMLEY EVANS wrote on 2000-07-25 10:03 UTC:
> So what should mbtowc(&wc, "\xED\xB2\x80", 3) return?

If you follow my option D), then it should set wc = 0xdced and return 1,
because it has decoded one byte.

\xED\xB2\x80 is according to ISO 10646-1 not valid UTF-8 anyway.

Note that my suggested semantics includes the traditional one without
loss of information. You can always write around an option D (UTF-8B)
mbtowc() implementation a simple wrapper of the form

int mbtowc(wchar_t * restrict pwc,
                       const char * restrict s,
                       size_t n)
{
  int r;
  wchar_t w;

  r = mbtowc(&w, s, n);
  if (s && r >= 0)
    if ((w & 0xff80) == 0xdc80)
      return -1;
    else
      if (pwc) *pwc = w;

  return r;
}

if you want mbtowc() to report malformed UTF-8 sequences the traditional
way via returning -1 (and preserve all the other traditional mbtowc()
semantics).

> I really don't like the idea of a UTF-8 decoder having to know about
> surrogates which have nothing to do with UTF-8.

The UTF-8 definition in ISO 10646-1 talks in Note 3 of appendix R
already about surrogates anyway, people just prefer not to read the fine
print ...

http://www.cl.cam.ac.uk/~mgk25/ucs/ISO-10646-UTF-8.html

> If that sort of thing
> starts being imposed, I start to wonder whether Unicode really is too
> complex to be secure ...

All I want to do is to carefully engineer the API, such that dangerous
complexities are hidden in a clever way inside mbtowc() and friends, so
as to make it less likely that users of these functions will do
dangerous things accidentally.

A mbtowc() that never returns an error is certainly far easier (which
often means also safer) to use than one that requires the programmer to
go through various error-prone exception-handling headaches.

Markus

-- 
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org,  WWW: <http://www.cl.cam.ac.uk/~mgk25/>

-
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/lists/

Reply via email to