Hi, At 09 Nov 2000 11:30:26 -0500, Itai Zukerman <[EMAIL PROTECTED]> wrote:
> I'm working on a piece of software that will parse textual data (a > list of words), conduct some statistical analyses, and spit out more > textual data. I'd like to support multiple languages, maybe even > multibyte encodings. Can someone please point me towards some > resources, in particular how to handle text input and output in a > language-independent way? As you can probably guess, I'm new to i18n. I recommend you to use wchar_t-related functions. I think this is the standard way to handle multiple encodings. Using wchar_t, the software can be locale-sensible. If a user sets his/her environment as UTF-8 (for example, export LC_CTYPE=en_US.UTF-8) the software can input/output UTF-8. This is important all that a user has to do is to set LANG variable for all softwares to work within the desiable locale. Your software will need to call setlocale(LC_ALL,""); at first. LC_ALL may be LC_CTYPE. Input is done using fgetc() + mbstowcs() or fgetwc(). Output is done using wcstombs() + fputc() or fgetwc(). Principle is that: I/O is done using 'multibyte charachter' and internal processing is done using 'wide character'. Note that multibyte character is not always multibyte. It is a term in the C standard. Note that your software will also support UTF-8 with wchar_t style programming, if the OS supports UTF-8 locale. Direct implementing of UTF-8 will cause that: - the software supports only UTF-8 (and latin-1?). - the user cannot use common way to switch locale. Since these functions are standard C functions, you will be able to study wchar_t with various books. Please read the following: http://www.debian.org/doc/manuals/intro-i18n/index.html This is a Debian Documentation Project document on i18n, whose writer is me. (I have to update it for glibc 2.1.9x...) --- Tomohiro KUBOTA <[EMAIL PROTECTED]> http://surfchem0.riken.go.jp/~kubota/

