Re: Questions about Unicode-aware C programs under Linux

Ali Majdzadeh Mon, 16 Apr 2007 23:17:35 -0700

Hello Rich
Thanks for your response.
About your question, I should say "yes", I need some text processing
capabilities.
Do you mean that I should use common stdio functions? (like, fgets(), ...)
And what about UTF-8 strings? Do you mean that these strings should be
stored in common char*
variables? So, what about the character size defference (Unicode and ASCII)?
And also, string functions? (like, strtok())
Sorry, I am new to the issue.


Best Regards
Ali


On 4/16/07, Rich Felker <[EMAIL PROTECTED]> wrote:

On Mon, Apr 16, 2007 at 11:33:26AM +0330, Ali Majdzadeh wrote:
> Hello All
> Sorry, if my questions are elementary. As I know, the size of wchar_t
data
> type (glibc), is compiler and platform dependent. What is the best
practice
> of writing portable Unicode-aware C programs? Is it a good practice to
use
> Unicode literals directly in a C program?

It depends on the degree of portability you want. Using them in wide
strings is not entirely portable (depends on the translation character
encoding), but using them in UTF-8 strings is (they're just byte
sequences).

> I have experienced some problems
> with glibc's wide character string functions, I want to know is there
any
> standard way of programming or standard template to write a
Unicode-aware C
> program? By the way, my native language is Persian. I am working on a C
> program which reads a Persian text file, parses it and generates an XML
> document.

If your application is Persian-specific, then you're completely
entitled to assume the text encoding is UTF-8 and that the system is
capable of dealing with UTF-8 and Unicode. Will there be any Persion
specific text processing though or do you just want to be able to pass
through Persian text?

> For this, there exist lots of issues that need the use of library
> functions (eg. wcscpy(), wcsstr(), wcscmp(), fgetws(), wfprintf(), ...),
> and, as I mentioned earlier, I have experienced some odd problems using
> them. (eg. wcsstr() never succeeds in matching two wchar_t * Persian
> strings.)

wcsstr doesn't care about encoding or Unicode semantics or anything.
It just looks for binary substring matches, just like strstr but using
wchar_t instead of char as the unit.

Overall I'd suggest ignoring the wchar_t functions. Especially the
wide stdio functions are problematic. Using UTF-8 is just as easy and
then your strings are directly usable for input and output to/from
text files, commandline, etc.

Rich

--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Re: Questions about Unicode-aware C programs under Linux

Reply via email to