Re: I/O for different encodings

Tomohiro KUBOTA Fri, 10 Nov 2000 07:52:07 -0600

Hi,

At 09 Nov 2000 11:30:26 -0500,
Itai Zukerman <[EMAIL PROTECTED]> wrote:


> I'm working on a piece of software that will parse textual data (a
> list of words), conduct some statistical analyses, and spit out more
> textual data.  I'd like to support multiple languages, maybe even
> multibyte encodings.  Can someone please point me towards some
> resources, in particular how to handle text input and output in a
> language-independent way?  As you can probably guess, I'm new to i18n.

I recommend you to use wchar_t-related functions.  I think this is the 
standard way to handle multiple encodings.

Using wchar_t, the software can be locale-sensible.  If a user sets
his/her environment as UTF-8 (for example, export LC_CTYPE=en_US.UTF-8)
the software can input/output UTF-8.  This is important all that a 
user has to do is to set LANG variable for all softwares to work within
the desiable locale.

Your software will need to call setlocale(LC_ALL,""); at first.
LC_ALL may be LC_CTYPE.  Input is done using fgetc() + mbstowcs()
or fgetwc().  Output is done using wcstombs() + fputc() or fgetwc().
Principle is that: I/O is done using 'multibyte charachter' and
internal processing is done using 'wide character'.  Note that
multibyte character is not always multibyte.  It is a term in the
C standard.

Note that your software will also support UTF-8 with wchar_t style
programming, if the OS supports UTF-8 locale.  Direct implementing
of UTF-8 will cause that:
 - the software supports only UTF-8 (and latin-1?).
 - the user cannot use common way to switch locale.

Since these functions are standard C functions, you will be able to
study wchar_t with various books.

Please read the following:
http://www.debian.org/doc/manuals/intro-i18n/index.html
This is a Debian Documentation Project document on i18n, whose
writer is me.  (I have to update it for glibc 2.1.9x...)

---
Tomohiro KUBOTA <[EMAIL PROTECTED]>
http://surfchem0.riken.go.jp/~kubota/

Re: I/O for different encodings

Reply via email to