Re: [discuss] WCHAR_T <=> UTF-8 conversion

Alexander Pyhalov via illumos-discuss Fri, 15 Aug 2014 08:07:29 -0700

On 08/15/2014 18:44, Garrett D'Amore wrote:

I don't know why the icon module is dumping core, but recognize you can't
just wchar_t "Hello"


The wide characters have to be initialized properly; it's not generally
possible to this from constant values directly.  Instead you have to
convert to the wide characters first, from another format.  I recommend, if
you are running in a UTF-8 locale (such as ru_RU.UTF-8), that you use
mbstowcs() to convert a UTF-8 string to wchar_t's.  You should then be able
to convert from UCS-4 to UTF-8.  (Note carefully though, the fact that the
wchar_t's are UCS-4 is *not an interface*.  The encoding of wchar_t's is a
platform implementation detail.

Here's an example:

   ...
   wchar_t wcs[32];
   char utf8[32];
   size_t inlen, outlen;
   iconv_t hdl;

   setlocale(LC_ALL, "ru_RU.UTF-8");

   mbstowcs(&wcs, "спасибо болшой", sizeof (wcs) / sizeof (wcs[0]));
   // wcs now contains UCS-4 version of Russian thank you very much
   ...
   inlen = wcslen(wcs) * sizeof (wchar_t);
   outlen = sizeof (utf8);

   hdl = iconv_open("UTF-8", "UCS-4");
   iconv(hdl, wcs, &inlen, utf8, &outlen);
   // utf8 now contains "спасибо болшой"


Let's try...

  char out[1024];
  iconv_t cd;
  int ret;
  wchar_t in[1024];
  size_t inlen;

  size_t outsz=sizeof(out);

  setlocale(LC_ALL,"ru_RU.UTF-8");

  mbstowcs(in,"Привет!",sizeof (in) / sizeof (in[0]));
  inlen=wcslen(in) * sizeof (wchar_t);
  cd = iconv_open("UTF-8","UCS-4");
           if (cd == (iconv_t)-1) {
               (void) fprintf(stderr, "iconv_open failed\n");
               return (1);
           }
  iconv(cd,&in,&inlen,&out,&outsz);

$ ./test_utf8_mbchar
Segmentation Fault (core dumped)

Note that the above is most definitely *not* the recommended way to get to
UCS-4.  The only formally correct way to get to UCS-4 from UTF-8 is to use
iconv() to convert from UTF-8.  The only APIs that you should formally be
sending wchar_t's to are the wide character routines (e.g. wcslen()).
  Passing wchar_t's directly to iconv as I've done above is technically
incorrect, although I believe in the case above it will work.

Note that this will not work in the "C" locale.


And what is the recommended way of converting wchar_t * to UTF-8 char *?

--
Best regards,
Alexander Pyhalov,
system administrator of Computer Center of Southern Federal University


-------------------------------------------
illumos-discuss
Archives: https://www.listbox.com/member/archive/182180/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182180/21175430-2e6923be
Modify Your Subscription: 
https://www.listbox.com/member/?member_id=21175430&id_secret=21175430-6a77cda4
Powered by Listbox: http://www.listbox.com

Re: [discuss] WCHAR_T <=> UTF-8 conversion

Reply via email to