Hello Rich Thanks a lot. That was really a nice clarification of different aspects of the issue, and sorry again if my questions were so elementary. But for me, that was a nice discussion and I learned a lot. Thank you so much.
Best Regards Ali On 4/17/07, Rich Felker <[EMAIL PROTECTED]> wrote:
On Tue, Apr 17, 2007 at 03:17:48PM +0000, Ali Majdzadeh wrote: > Hi Rich > Thanks for your attention. I do use UTF-8 but the files I am dealing with > are encoded using a strange encoding system, I used iconv to convert them > into UTF-8. By the way, another question, if all those stdio.h and > string.hfunctions, work well with UTF-8 strings, as they actually do, > what would be > the reason to use wchar_t and wchar_t-aware functions? There are a mix of reasons, but most stem from the fact that the Japanese designed some really bad encodings for their language prior to UTF-8, which are almost impossible to use in a standard C environment. At the time, the ANSI/ISO C committee thought that it would be necessary to avoid using char strings directly for multilingual text purposes, and was setting up to transition to wchar_t strings; however, this was very incomplete. Note that C has no support for using wchar_t strings as filenames, and likewise POSIX has no support for using them for anything having to do with interfacing with the system or library in places where strings are needed. Thus there was going to be a dichotomy where multilingual text would be a special case only available in some places, while system stuff, filenames, etc. would have to be ASCII. UTF-8 does away with that dichotomy. The main remaining use of wchar_t is that, if you wish to write portable C applications which work on many different text encodings (both UTF-8 and legacy) depending on the system's or user's locale, you can use mbrtowc/wcrtomb and related functions when it's necessary to determine the identity of a particular character in a char string, then use the isw* functions to ask questions like: Is it alphabetic? Is it printable? etc. On modern C systems (indicated by the presence of the __STDC_ISO_10646__ preprocessor symbol), wchar_t will be Unicode UCS-4, so if you're willing to sacrifice some degree of portability, you can use the values from the mb/wc conversions functions directly as Unicode character numbers for lookup in fonts, character data tables, etc. Another option is to use the iconv() API (but this is part of Single Unix Specification, not general C) to convert between the locale's encoding and UTF-8 or UCS-4 if you need to make sure your data is in a particular form. However, for natural language work where there's a reasonable expectation that any user of the software would be using UTF-8 as their encoding already, IMO it makes sense to just assume you're working in a UTF-8 environment. Some may disagree on this. Hope this helps. Rich -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
