Re: Lazy man's UTF8

Michael B. Allen Wed, 18 Sep 2002 11:12:53 -0700

On Wed, 18 Sep 2002 11:07:27 -0400
"Maiorana, Jason" <[EMAIL PROTECTED]> wrote:


> 
> >I have this simple little program, it uses locales (a bit) and even
> >has simple gettext internationalisation, now I want to convert it so
> >that it'll work on a completely UTF-8 locale _or_ a ISO8859-* locale
> >(as it does now) or even an ISO8859-* interface on a UTF-8 system.
> 
> If you dont want to worry about it too much, just use the mb functions
> and let the locale control what they do:
> 
> mblen instead of strlen
> strcoll instead of strcmp
> etc
> 
> 
> 
> 
> If you want to use hardcoded internal utf-8, then only convert on
> output, then iconv is perfect, and shockingly easy to use.
> Im actually a fan of completely ignoring locale as far as codesets
> go: Ill use utf-8 internally, and always output utf-8. (Locales
> are fine for date formatting)
> 
> To go that route you do need a good utf-8 to wchar_t converter,
> and wchar_t to utf-8 layouter occaisionally. These things are
> ubiquitous, you can even write your own:
> 
> 
> //here is an example utf-8 formatter
> //it turns ucs-4 character "value" into a utf-8 string held in "buf"
> //which must have room for at least 6 bytes
> //the return value is the length of the utf-8 string
> 
> int ucs4toutf8( wchar_t value, unsigned char *buf )

This assumes the user is in an __STDC_ISO_10646__ environment. You have
to use wctomb and mbtowc.

> {
>     if( value <=      0x0000007F )
>     {
>         buf[0] = (unsigned char)value;
>         return 1;
>     }
>     else if( value <= 0x000007FF )
>     {
>         buf[1] = (unsigned char)(value & 0x3F | 0x80);
>         value>>=6;
>         buf[0] = (unsigned char)(value & 0x1F | 0xC0);
>         return 2;
>     }
>     else if( value <= 0x0000FFFF )
>     {
>         buf[2] = (unsigned char)(value & 0x3F | 0x80);
>         value>>=6;
>         buf[1] = (unsigned char)(value & 0x3F | 0x80);
>         value>>=6;
>         buf[0] = (unsigned char)(value & 0x0F | 0xE0);
>         return 3;
>     }
>     else if( value <= 0x001FFFFF )
>     {
>         buf[3] = (unsigned char)(value & 0x3F | 0x80);
>         value>>=6;
>         buf[2] = (unsigned char)(value & 0x3F | 0x80);
>         value>>=6;
>         buf[1] = (unsigned char)(value & 0x3F | 0x80);
>         value>>=6;
>         buf[0] = (unsigned char)(value & 0x07 | 0xF0);
>         return 4;
>     }
>     else if( value <= 0x03FFFFFF )
>     {
>         buf[4] = (unsigned char)(value & 0x3F | 0x80);
>         value>>=6;
>         buf[3] = (unsigned char)(value & 0x3F | 0x80);
>         value>>=6;
>         buf[2] = (unsigned char)(value & 0x3F | 0x80);
>         value>>=6;
>         buf[1] = (unsigned char)(value & 0x3F | 0x80);
>         value>>=6;
>         buf[0] = (unsigned char)(value & 0x03 | 0xF8);
>         return 5;
>     }
>     else if( value <= 0x7FFFFFFF )
>     {
>         buf[5] = (unsigned char)(value & 0x3F | 0x80);
>         value>>=6;
>         buf[4] = (unsigned char)(value & 0x3F | 0x80);
>         value>>=6;
>         buf[3] = (unsigned char)(value & 0x3F | 0x80);
>         value>>=6;
>         buf[2] = (unsigned char)(value & 0x3F | 0x80);
>         value>>=6;
>         buf[1] = (unsigned char)(value & 0x3F | 0x80);
>         value>>=6;
>         buf[0] = (unsigned char)(value & 0x01 | 0xFC);
>         return 6;
>     }
>     return 0;
> }
> --
> Linux-UTF8:   i18n of Linux on all levels
> Archive:      http://mail.nl.linux.org/linux-utf8/
> 
> 


-- 
A  program should be written to model the concepts of the task it
performs rather than the physical world or a process because this
maximizes  the  potential  for it to be applied to tasks that are
conceptually  similar and more importantly to tasks that have not
yet been conceived. 
--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Re: Lazy man's UTF8

Reply via email to