Re: c++ strings and UTF-8 (other charsets)
William J Poser wrote: Although a zero byte may not be part of a C string, it may be part of a character string literal. See section 6.4.5, p. 62, of the C99 standard. character string literals need not be strings. Ok, so no danger here. Thanks Marcel Bill -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/ -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: c++ strings and UTF-8 (other charsets)
Rich Felker wrote: On Tue, Feb 27, 2007 at 07:49:17PM -0500, Daniel B. wrote: Marcel Ruff wrote: As UTF-8 may not contain '\0' ... Yes it can. No, I think he just meant to say a string of non-NUL _characters_ may not contain a 0 _byte_. The NUL character is not valid text or a valid part of a string in the POSIX sense of text or the C/POSIX sense of string. Yes, you describe my issue more precise. thanks Marcel -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: c++ strings and UTF-8 (other charsets)
Are you thinking of Java's _modified_ version of UTF-8 (http://en.wikipedia.org/wiki/UTF-8#Java)? Uhg, disgusting... Yes - this is an open serious issue for my approach! Has anybody some practical advice on this? Marcel -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: c++ strings and UTF-8 (other charsets)
Rich Felker wrote: On Thu, Mar 01, 2007 at 09:41:44AM +0100, Marcel Ruff wrote: Are you thinking of Java's _modified_ version of UTF-8 (http://en.wikipedia.org/wiki/UTF-8#Java)? Uhg, disgusting... Yes - this is an open serious issue for my approach! Has anybody some practical advice on this? Just treat the sequence c0 80 according to the spec, as an invalid sequence. Neither it (because it's illegal utf-8) nor a real NUL (because it's illegal in text) should appear. If your problem is more specific and there's a real reason you need to handle such data differently, please describe what you're doing so we can offer better advice. The first sentence from the above wiki says: In normal usage, the Java programming language http://en.wikipedia.org/wiki/Java_%28programming_language%29 supports standard UTF-8 when reading and writing strings through |InputStreamReader http://java.sun.com/javase/6/docs/api/java/io/InputStreamReader.html| and |OutputStreamWriter http://java.sun.com/javase/6/docs/api/java/io/OutputStreamWriter.html| and this is what i do to access sockets, so no problems here. But then it states that 'Supplementary multilingual plane' is encoded incompatible. So must i assume if i send 'mathematical alphanumeric symbols' http://en.wikipedia.org/wiki/Mathematical_alphanumeric_symbols like 'ℝ' from C to java they will be corrupted? Both applications work with what they think is 'UTF-8' ... Marcel Rich -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/ -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: c++ strings and UTF-8 (other charsets)
Rich Felker wrote: On Thu, Mar 01, 2007 at 07:53:52PM +0100, Marcel Ruff wrote: Are you thinking of Java's _modified_ version of UTF-8 (http://en.wikipedia.org/wiki/UTF-8#Java)? The first sentence from the above wiki says: In normal usage, the Java programming language http://en.wikipedia.org/wiki/Java_%28programming_language%29 supports standard UTF-8 when reading and writing strings through |InputStreamReader http://java.sun.com/javase/6/docs/api/java/io/InputStreamReader.html| and |OutputStreamWriter http://java.sun.com/javase/6/docs/api/java/io/OutputStreamWriter.html| and this is what i do to access sockets, so no problems here. But then it states that 'Supplementary multilingual plane' is encoded incompatible. Oh, you're talking about that part, not the NUL issue. Then yes, it's a major problem. Java generates and processes bogus illegal UTF-8 (surrogates). I don't know if there are any easy workarounds except to flame Sun to hell for being so stupid.. So must i assume if i send 'mathematical alphanumeric symbols' http://en.wikipedia.org/wiki/Mathematical_alphanumeric_symbols like 'ℝ' from C to java they will be corrupted? ℝ is in the BMP, so no problem with it. It's just the huge pages of random letters in every single font/style imaginable that are outside the BMP. Of course various important CJK characters (needed for writing certain names) and historical scripts are also outside the BMP. Both applications work with what they think is 'UTF-8' ... Yes. And Java is wrong. However, according to the Wikipedia article referenced, Java _does_ do the right thing in input and output streams. It's only the object serialization stuff that uses the bogus UTF-8. So I don't think you're likely to have problems in practice as long as you don't try to pass this data off (which would be in binary files anyway, I think...?) as UTF-8. Ok, thanks, so porting legacy C/C++ to unicode UTF-8 is simple :-) Marcel -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: c++ strings and UTF-8 (other charsets)
Daniel B. wrote: Marcel Ruff wrote: ... As UTF-8 may not contain '\0' ... Yes it can. Are you thinking of Java's _modified_ version of UTF-8 (http://en.wikipedia.org/wiki/UTF-8#Java)? Oi oi oi, this complicates things again. 1. Serializing UTF-8 in Java over a socket and reading it in C/C++ as UTF-8 could make problems? - Is there a Java-UTF-8-standard conversion utility? 2. Using C UTF-8: When/how can it happen that a char* contains a '\0' which is a character instead of the end of a char* ? thanks for some enlightment, Marcel Daniel -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: c++ strings and UTF-8 (other charsets)
Rich Felker wrote: On Mon, Feb 26, 2007 at 03:35:05PM +0100, Stephane Bortzmeyer wrote: On Mon, Feb 26, 2007 at 08:10:59AM +0100, Marcel Ruff [EMAIL PROTECTED] wrote a message of 65 lines which said: As UTF-8 may not contain '\0' you can simply use all functions as before (strcmp(), std::string etc.). As long as you just store or retrieve strings. If you compare them (strcmp), you HAVE TO take normalization into account. No you don't. Nothing in Unicode says that you must treat canonically equivalent strings as identical, and in fact doing so is a bad idea in most of the situations I've worked with. Unicode only says that you should not assume that another process (in the Unicode sense of the word process) will treat them as being distinct. If your particular application has a special need for normalization, then yes you need to take it into account. But if you're doing something like passing around filenames you most surely should not be normalizing anything. If you measure them (strlen), you HAVE TO use a character semantic, not a byte semantic. And so on. Huh? Length in characters is basically useless to know. Length in bytes and width of the text when rendered to a visual presentation are both useful, but the only place where knowing length in number of characters is useful is for fields that are limited to a fixed number of characters. If the limit is for the sake of using a fixed-size storage object, then this limit should just be changed to a limit in bytes instead of in characters.. Old code doesn't need to be ported. Very strange advice, indeed. ?? Hardly strange.. It depends on what the code does. See Markus Kuhn's UTF-8 FAQ. But Marcel is right about a lot of old code (just not all). Most code doesn't care at all about the contents of the text, just that it's a string. Thanks for all those details. I can only tell that when i started to port a C and a C++ library to support unicode on Linux/Unix/Windows/WindowsCE is was totally lost with the heaps of complicated and confusing advice found in the internet (the reason why i joined this mailing list). But in the end everything was very simple: 1. UTF-8 does not contain zero bytes 2. Doing all in UTF-8 and keeping my std::string and char* was a very simple solution 3. I would need to define own data types if i want to support UTF-16 (similar to xerces an all the others) This would be a major effort. 4. Take care when passing the strings to other libraries / GUIs as mentioned in my first post Getting to above *simple* insight took me several confused days, after that the porting effort was done in one day. I just wanted to share this to save others all the confusion, Marcel Rich -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/ -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: c++ strings and UTF-8 (other charsets)
Rich Felker wrote: On Sat, Feb 24, 2007 at 06:13:37PM +0100, Julien Claassen wrote: Hi! What I meant about UTF-8-strings in c++: I mean in c and c++ they're not standard like in Java. UTF-16, used by Java, is also variable-width. It can be either 2 bytes or 4 bytes per character. Support for the characters that use 4 bytes is generally very poor due to the misconception that it's fixed-width.. :( I think UTF-8 is a variable width multibyte charset, so there are specific problems in handling them allocating the right space. I mean the Glib contains something like UString and QT has its QStrings, which I think are also UTF-8 capable. As far as i know: Using UTF-8 in C or C++ is very simple: As UTF-8 may not contain '\0' you can simply use all functions as before (strcmp(), std::string etc.). Old code doesn't need to be ported. The only place to take care is when interfacing other libraries using wchar_t and such (UTF-16, UTF-32), here you need to convert using functions like wcstrtombs(), mbstrtowcs(), mbrtowc() and such. This works well on Linux, Windows or other OS, Marcel All strings are UTF-8 capable; the unit of data is simply bytes instead of characters. If you're looking for a class that treats strings as a sequence of abstract characters rather than a sequence of bytes, you could look for a library to do this or write your own. However I suspect the most useful way to do this on C++ would be to extend whatever standard byte-based string class you're using with a derived class. Maybe there's something like this built in to the C++ STL classes already that I'm not aware of. As I said I don't know much of (modern) C++. Can someone who knows the language better provide an answer? It would also be easier to provide you answers if we knew better what you're trying to do with the strings, i.e. whether you just need to store them and spit them back in output, or whether you need to do higher-level unicode processing like line breaks, collation, rendering, etc. Rich -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/ -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/