Re: c++ strings and UTF-8 (other charsets)

2007-03-01 Thread Marcel Ruff

William J Poser wrote:

Although a zero byte may not be part of a C string, it may
be part of a character string literal. See section 6.4.5,
p. 62, of the C99 standard. character string literals 
need not be strings.
  

Ok, so no danger here.

Thanks
Marcel

Bill

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/


  



--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: c++ strings and UTF-8 (other charsets)

2007-03-01 Thread Marcel Ruff

Rich Felker wrote:

On Tue, Feb 27, 2007 at 07:49:17PM -0500, Daniel B. wrote:
  

Marcel Ruff wrote:




As UTF-8 may not contain '\0' ...
  

Yes it can.



No, I think he just meant to say a string of non-NUL _characters_ may
not contain a 0 _byte_. The NUL character is not valid text or a
valid part of a string in the POSIX sense of text or the C/POSIX
sense of string.
  

Yes, you describe my issue more precise.

thanks
Marcel


--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: c++ strings and UTF-8 (other charsets)

2007-03-01 Thread Marcel Ruff



Are you thinking of Java's _modified_ version of UTF-8
(http://en.wikipedia.org/wiki/UTF-8#Java)?



Uhg, disgusting...
  

Yes - this is an open  serious issue for my approach!

Has anybody some practical advice on this?

Marcel

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: c++ strings and UTF-8 (other charsets)

2007-03-01 Thread Marcel Ruff

Rich Felker wrote:

On Thu, Mar 01, 2007 at 09:41:44AM +0100, Marcel Ruff wrote:
  

Are you thinking of Java's _modified_ version of UTF-8
(http://en.wikipedia.org/wiki/UTF-8#Java)?
   


Uhg, disgusting...
 
  

Yes - this is an open  serious issue for my approach!

Has anybody some practical advice on this?



Just treat the sequence c0 80 according to the spec, as an invalid
sequence. Neither it (because it's illegal utf-8) nor a real NUL
(because it's illegal in text) should appear. If your problem is more
specific and there's a real reason you need to handle such data
differently, please describe what you're doing so we can offer better
advice.
  

The first sentence from the above wiki says:

In normal usage, the Java programming language 
http://en.wikipedia.org/wiki/Java_%28programming_language%29 supports 
standard UTF-8 when reading and writing strings through 
|InputStreamReader 
http://java.sun.com/javase/6/docs/api/java/io/InputStreamReader.html| 
and |OutputStreamWriter 
http://java.sun.com/javase/6/docs/api/java/io/OutputStreamWriter.html|


and this is what i do to access sockets, so no problems here.

But then it states that 'Supplementary multilingual plane' is encoded 
incompatible.

So must i assume if i send 'mathematical alphanumeric symbols'
http://en.wikipedia.org/wiki/Mathematical_alphanumeric_symbols
like 'ℝ' from C to java they will be corrupted?
Both applications work with what they think is 'UTF-8' ...

Marcel

Rich

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/


  



--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: c++ strings and UTF-8 (other charsets)

2007-03-01 Thread Marcel Ruff

Rich Felker wrote:

On Thu, Mar 01, 2007 at 07:53:52PM +0100, Marcel Ruff wrote:
  

Are you thinking of Java's _modified_ version of UTF-8
(http://en.wikipedia.org/wiki/UTF-8#Java)?


The first sentence from the above wiki says:

In normal usage, the Java programming language 
http://en.wikipedia.org/wiki/Java_%28programming_language%29 supports 
standard UTF-8 when reading and writing strings through 
|InputStreamReader 
http://java.sun.com/javase/6/docs/api/java/io/InputStreamReader.html| 
and |OutputStreamWriter 
http://java.sun.com/javase/6/docs/api/java/io/OutputStreamWriter.html|


and this is what i do to access sockets, so no problems here.

But then it states that 'Supplementary multilingual plane' is encoded 
incompatible.



Oh, you're talking about that part, not the NUL issue. Then yes, it's
a major problem. Java generates and processes bogus illegal UTF-8
(surrogates). I don't know if there are any easy workarounds except to
flame Sun to hell for being so stupid..

  

So must i assume if i send 'mathematical alphanumeric symbols'
http://en.wikipedia.org/wiki/Mathematical_alphanumeric_symbols
like 'ℝ' from C to java they will be corrupted?



ℝ is in the BMP, so no problem with it. It's just the huge pages of
random letters in every single font/style imaginable that are outside
the BMP. Of course various important CJK characters (needed for
writing certain names) and historical scripts are also outside the
BMP.

  

Both applications work with what they think is 'UTF-8' ...



Yes. And Java is wrong. However, according to the Wikipedia article
referenced, Java _does_ do the right thing in input and output
streams. It's only the object serialization stuff that uses the bogus
UTF-8. So I don't think you're likely to have problems in practice as
long as you don't try to pass this data off (which would be in binary
files anyway, I think...?) as UTF-8.
  

Ok, thanks, so porting legacy C/C++ to unicode UTF-8 is simple :-)

Marcel

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: c++ strings and UTF-8 (other charsets)

2007-02-28 Thread Marcel Ruff

Daniel B. wrote:

Marcel Ruff wrote:
  
...
  

As UTF-8 may not contain '\0' ...



Yes it can.

Are you thinking of Java's _modified_ version of UTF-8
(http://en.wikipedia.org/wiki/UTF-8#Java)?
  

Oi oi oi, this complicates things again.

1. Serializing UTF-8 in Java over a socket and reading it in C/C++ as 
UTF-8 could make problems?

  - Is there a Java-UTF-8-standard conversion utility?

2. Using C UTF-8: When/how can it happen that a char* contains a '\0' 
which is a character instead of

   the end of a char* ?

thanks for some enlightment,

Marcel


Daniel
  



--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: c++ strings and UTF-8 (other charsets)

2007-02-27 Thread Marcel Ruff

Rich Felker wrote:

On Mon, Feb 26, 2007 at 03:35:05PM +0100, Stephane Bortzmeyer wrote:
  

On Mon, Feb 26, 2007 at 08:10:59AM +0100,
 Marcel Ruff [EMAIL PROTECTED] wrote 
 a message of 65 lines which said:




As UTF-8 may not contain '\0' you can simply use all functions as
before (strcmp(), std::string etc.).
  

As long as you just store or retrieve strings. If you compare them
(strcmp), you HAVE TO take normalization into account.



No you don't. Nothing in Unicode says that you must treat canonically
equivalent strings as identical, and in fact doing so is a bad idea in
most of the situations I've worked with. Unicode only says that you
should not assume that another process (in the Unicode sense of the
word process) will treat them as being distinct.

If your particular application has a special need for normalization,
then yes you need to take it into account. But if you're doing
something like passing around filenames you most surely should not be
normalizing anything.

  

If you measure
them (strlen), you HAVE TO use a character semantic, not a byte
semantic. And so on.



Huh? Length in characters is basically useless to know. Length in
bytes and width of the text when rendered to a visual presentation are
both useful, but the only place where knowing length in number of
characters is useful is for fields that are limited to a fixed number
of characters. If the limit is for the sake of using a fixed-size
storage object, then this limit should just be changed to a limit in
bytes instead of in characters..

  

Old code doesn't need to be ported.
  

Very strange advice, indeed.



?? Hardly strange.. It depends on what the code does. See Markus
Kuhn's UTF-8 FAQ.

But Marcel is right about a lot of old code (just not all). Most code
doesn't care at all about the contents of the text, just that it's a
string.
  

Thanks for all those details.

I can only tell that when i started to port a C and a C++ library to 
support unicode
on Linux/Unix/Windows/WindowsCE is was totally lost with the heaps of 
complicated
and confusing advice found in the internet (the reason why i joined this 
mailing list).


But in the end everything was very simple:

1. UTF-8 does not contain zero bytes
2. Doing all in UTF-8 and keeping my std::string and char* was a very 
simple solution
3. I would need to define own data types if i want to support UTF-16 
(similar to xerces an all the others)

  This would be a major effort.
4. Take care when passing the strings to other libraries / GUIs as 
mentioned in my first post


Getting to above *simple* insight took me several confused days,
after that the porting effort was done in one day.

I just wanted to share this to save others all the confusion,

Marcel


Rich

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/


  



--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: c++ strings and UTF-8 (other charsets)

2007-02-25 Thread Marcel Ruff

Rich Felker wrote:

On Sat, Feb 24, 2007 at 06:13:37PM +0100, Julien Claassen wrote:
  

Hi!
  What I meant about UTF-8-strings in c++: I mean in c and c++ they're not 
standard like in Java.



UTF-16, used by Java, is also variable-width. It can be either 2 bytes
or 4 bytes per character. Support for the characters that use 4 bytes
is generally very poor due to the misconception that it's
fixed-width.. :(

  
I think UTF-8 is a variable width multibyte charset, so 
there are specific problems in handling them allocating the right space. I 
mean the Glib contains something like UString and QT has its QStrings, which 
I think are also UTF-8 capable.


As far as i know:

Using UTF-8 in C or C++ is very simple:
As UTF-8 may not contain '\0' you can simply use all
functions as before (strcmp(), std::string etc.).
Old code doesn't need to be ported.

The only place to take care is when interfacing other libraries
using wchar_t and such (UTF-16, UTF-32), here
you need to convert using functions like wcstrtombs(), mbstrtowcs(), 
mbrtowc() and such.

This works well on Linux, Windows or other OS,

Marcel


All strings are UTF-8 capable; the unit of data is simply bytes
instead of characters. If you're looking for a class that treats
strings as a sequence of abstract characters rather than a sequence of
bytes, you could look for a library to do this or write your own.
However I suspect the most useful way to do this on C++ would be to
extend whatever standard byte-based string class you're using with a
derived class.

Maybe there's something like this built in to the C++ STL classes
already that I'm not aware of. As I said I don't know much of (modern)
C++. Can someone who knows the language better provide an answer?

It would also be easier to provide you answers if we knew better what
you're trying to do with the strings, i.e. whether you just need to
store them and spit them back in output, or whether you need to do
higher-level unicode processing like line breaks, collation,
rendering, etc.

Rich

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/


  



--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/