Re: Questions about Unicode-aware C programs under Linux

Ali Majdzadeh Tue, 17 Apr 2007 11:58:17 -0700

Hello Rich
Thanks a lot. That was really a nice clarification of different aspects of
the issue, and sorry again if my questions were so elementary. But for me,
that was a nice discussion and I learned a lot.
Thank you so much.


Best Regards
Ali

On 4/17/07, Rich Felker <[EMAIL PROTECTED]> wrote:

On Tue, Apr 17, 2007 at 03:17:48PM +0000, Ali Majdzadeh wrote:
> Hi Rich
> Thanks for your attention. I do use UTF-8 but the files I am dealing
with
> are encoded using a strange encoding system, I used iconv to convert
them
> into UTF-8. By the way, another question, if all those stdio.h and
> string.hfunctions, work well with UTF-8 strings, as they actually do,
> what would be
> the reason to use wchar_t and wchar_t-aware functions?

There are a mix of reasons, but most stem from the fact that the
Japanese designed some really bad encodings for their language prior
to UTF-8, which are almost impossible to use in a standard C
environment. At the time, the ANSI/ISO C committee thought that it
would be necessary to avoid using char strings directly for
multilingual text purposes, and was setting up to transition to
wchar_t strings; however, this was very incomplete. Note that C has no
support for using wchar_t strings as filenames, and likewise POSIX has
no support for using them for anything having to do with interfacing
with the system or library in places where strings are needed. Thus
there was going to be a dichotomy where multilingual text would be a
special case only available in some places, while system stuff,
filenames, etc. would have to be ASCII. UTF-8 does away with that
dichotomy.

The main remaining use of wchar_t is that, if you wish to write
portable C applications which work on many different text encodings
(both UTF-8 and legacy) depending on the system's or user's locale,
you can use mbrtowc/wcrtomb and related functions when it's necessary
to determine the identity of a particular character in a char string,
then use the isw* functions to ask questions like: Is it alphabetic?
Is it printable? etc.

On modern C systems (indicated by the presence of the
__STDC_ISO_10646__ preprocessor symbol), wchar_t will be Unicode
UCS-4, so if you're willing to sacrifice some degree of portability,
you can use the values from the mb/wc conversions functions directly
as Unicode character numbers for lookup in fonts, character data
tables, etc.

Another option is to use the iconv() API (but this is part of Single
Unix Specification, not general C) to convert between the locale's
encoding and UTF-8 or UCS-4 if you need to make sure your data is in a
particular form.

However, for natural language work where there's a reasonable
expectation that any user of the software would be using UTF-8 as
their encoding already, IMO it makes sense to just assume you're
working in a UTF-8 environment. Some may disagree on this.

Hope this helps.

Rich

--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Re: Questions about Unicode-aware C programs under Linux

Reply via email to