RE: Roundtripping in Unicode

2004-12-13 Thread Lars Kristan
Title: RE: Roundtripping in Unicode Marcin 'Qrczak' Kowalczyk wrote: You are trying to stick with processing byte sequences, carefully preserving the storage format instead of preserving the meaning in terms of Unicode characters. This leads to less robust software which is not certain

RE: Nicest UTF

2004-12-13 Thread Lars Kristan
Title: RE: Nicest UTF Marcin 'Qrczak' Kowalczyk wrote: My my, you are assuming all files are in the same encoding. Yes. Otherwise nothing shows filenames correctly to the user. UNIX is a multi user system. One user can use one locale and might never see files from another user that uses

RE: Nicest UTF

2004-12-13 Thread Lars Kristan
Title: RE: Nicest UTF D. Starner wrote: Lars Kristan writes: A system administrator (because he has access to all files). My my, you are assuming all files are in the same encoding. And what about all the references to the files in scripts? In configuration files? Soft links?

Re: Roundtripping in Unicode

2004-12-13 Thread Marcin 'Qrczak' Kowalczyk
Lars Kristan [EMAIL PROTECTED] writes: But, as I once already said, you can do it with UTF-8, you simply keep the invalid sequences as they are, and really handle them differently only when you actually process them or display them. UTF-8 is painful to process in the first place. You are

RE: Roundtripping in Unicode

2004-12-13 Thread Lars Kristan
Title: RE: Roundtripping in Unicode Philippe Verdy wrote: An implementation that uses UTF-8 for valid string could use the invalid ranges for lead bytes to encapsultate invalid byte values. Note however that invalid bytes you would need to represent have 256 possible values, but

RE: RE: Roundtripping in Unicode

2004-12-13 Thread Lars Kristan
Title: RE: RE: Roundtripping in Unicode Philippe VERDY wrote: If a source sequence is invalid, and you want to preserve it, then this sequence must remain invalid if you change its encoding. So there's no need for Unicode to assign valid code points for invalid source data. Using

Re: US-ASCII (was: Re: Invalid UTF-8 sequences)

2004-12-13 Thread Otto Stolz
Mark Davis schrieb: This is just a confusion among the hoi polloi. And here we have yet another example: hoi is Greek for the (hoi polloi = the many). Best wishes, Otto Stolz

Re: Roundtripping in Unicode

2004-12-13 Thread Marcin 'Qrczak' Kowalczyk
Lars Kristan [EMAIL PROTECTED] writes: And once we understand that things are manageable and not as frigtening as it seems at first, then we can stop using this as an argument against introducing 128 codepoints. People who will find them useful should and will bother with the consequences.

Re: US-ASCII (was: Re: Invalid UTF-8 sequences)

2004-12-13 Thread Mark Davis
Thanks. I have gotten several messages from people who didn't get the joke; that the sentence itself was an example of just the sorts of redundancy being discussed. Mark - Original Message - From: Otto Stolz [EMAIL PROTECTED] To: Mark Davis [EMAIL PROTECTED] Cc: [EMAIL PROTECTED] Sent:

Re: US-ASCII (was: Re: Invalid UTF-8 sequences)

2004-12-13 Thread John H. Jenkins
On Dec 10, 2004, at 1:25 PM, Tim Greenwood wrote: Is that like the 'Please RSVP' that I see all too often? Or should that not be excused? Or -- my own personal favorite -- in the year AD 2004.

RE: Nicest UTF

2004-12-13 Thread D. Starner
Some won't convert any and will just start using UTF-8 for new ones. And this should be allowed. Why should it be allowed? You can't mix items with different unlabeled encodings willy-nilly. All you're going to get, all you can expect to get is a mess. --

Re: Thanks: auto loading Hebrew and Russian fonts ; Re: Unicode HTML, download

2004-12-13 Thread Peter Kirk
On 13/12/2004 14:57, Peter R. Mueller-Roemer wrote: ... Did you make a formal proposal for NBSP, as Elaine did for Samaritan ? The use of NBSP to carry combining marks is nothing to do with me. It has been in Unicode from the start, as an alternative to SPACE. Not long ago several problems were

Re: RE: Roundtripping in Unicode

2004-12-13 Thread Philippe VERDY
Lars Kristan wrote: What I was talking about in the paragraph in question is what happens if you want to take unassigned codepoints and give them a new status. You don't need to do that. No Unicode application must assign semantics to unassigned codepoints. If a source sequence is invalid, and you

RE: Thanks: auto loading Hebrew and Russian fonts ; Re: Unicode HTML, download

2004-12-13 Thread Jony Rosenne
Title: -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]] On Behalf Of Peter R. Mueller-Roemer Sent: Monday, December 13, 2004 4:58 PM To: Peter Kirk Cc: Unicode Mailing List Subject: Re: Thanks: auto loading Hebrew and Russian fonts ; Re: Unicode HTML,

Re: Roundtripping in Unicode

2004-12-13 Thread Mark Davis
Ken is absolutely right. It would be theoretically possible to add 128 code points that would allow one to roundtrip a bytestream after passing through a UTF-8 = UTF-32 conversion. (For that matter, it would be possible to add 2048 code points that would allow the same for a 16-bit data stream.)

FW: Subj: Displaying Chinese characters and Chu Nom characters

2004-12-13 Thread Magda Danish \(Unicode\)
-Original Message- Date/Time:Sun Dec 12 18:57:05 CST 2004 Contact: [EMAIL PROTECTED] Report Type: Other Question, Problem, or Feedback Opt Subject: Displaying Chinese characters and Chu Nom characters Dear Unicode, I am using Windows Xp pro, and have the Chinese simplified

RE: RE: RE: Roundtripping in Unicode

2004-12-13 Thread Lars Kristan
Title: RE: RE: RE: Roundtripping in Unicode Philippe VERDY wrote: I don't think I miss the point. My suggested approach to perform roundtrip conversions between UTF's while keeping all invalid sequences as invalid (for the standard UTFs), is much less risky than converting them to

RE: Roundtripping in Unicode

2004-12-13 Thread Lars Kristan
Title: RE: Roundtripping in Unicode Kenneth Whistler wrote: Lars Kristan stated: I said, the choice is yours. My proposal does not prevent you from doing it your way. You don't need to change anything and it will still work the way it worked before. OK? I just want 128 codepoints

RE: Roundtripping in Unicode

2004-12-13 Thread Lars Kristan
Title: RE: Roundtripping in Unicode Ken is absolutely right. It would be theoretically possible to add 128 code points that would allow one to roundtrip a bytestream after passing through a UTF-8 = UTF-32 conversion. (For that matter, it would be possible to add 2048 code points

Re: Roundtripping in Unicode

2004-12-13 Thread Arcane Jill
If I have understood this correctly, filenames are not in a locale, they are absolute. Users, on the other hand, are in a locale, and users view filenames. The same filename can look different to two different users. To user A (whose locale is Latin-1), a filename might look valid; to user B

Validity and properties of U+FFFD (was RE: Roundtripping in Unico de)

2004-12-13 Thread Lars Kristan
Title: Validity and properties of U+FFFD (was RE: Roundtripping in Unicode) Doug Ewell wrote: Philippe VERDY wrote: (In fact I also think that mapping invalid sequences to U+FFFD is also an error, because U+FFFD is valid, and the presence of the encoding error in the source is

Re: RE: Roundtripping in Unicode

2004-12-13 Thread John Cowan
Doug Ewell scripsit: When faced with [an] ill-formed code unit sequence while transforming or interpreting text, a conformant process must treat the first code unit... as an illegally terminated code unit sequence -- for example, by signaling an error, filtering the code unit out, or

RE: Roundtripping in Unicode

2004-12-13 Thread Lars Kristan
Title: RE: Roundtripping in Unicode Peter Kirk wrote: Now no doubt many Unix filename handling utilities ignore the fact that some octets are invalid or uninterpretable in the locale, because they handle filenames as octet strings (with 0x00 and 0x2F having special

Re: When to validate?

2004-12-13 Thread Arcane Jill
I like that. Makes total sense. Thanks. Jill -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Behalf Of Antoine Leca Sent: 10 December 2004 17:38 To: Unicode Subject: Re: When to validate? As a result, your strings are likely to be some stuctures. Then, it is pretty

Re: Nicest UTF

2004-12-13 Thread John Cowan
Lars Kristan scripsit: I'm using ISO-8859-2. In fact you're lucky. Many ISO-8859-1 filenames display correctly in ISO-8859-2. Not all users are so lucky. It was a design point of ISO-8859-{1,2,3,4}, but not any other variants, that every character appears either at the same codepoint or not

Re: Thanks: auto loading Hebrew and Russian fonts ; Re: Unicode HTML, download

2004-12-13 Thread Peter R. Mueller-Roemer
If I send the html-message e.g. to the IVRIT-group, who's members work on all kinds of systems, I do not want them to have problems reading and printing my mail. Any remedy on the horizon? Use plain text e-mail, or HTML e-mail generated by your mail client (including Mozilla, Outlook or

RE: Thanks: auto loading Hebrew and Russian fonts ; Re: Unicode HTML, download

2004-12-13 Thread Peter Constable
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Peter R. Mueller-Roemer Did you make a formal proposal for NBSP, as Elaine did for Samaritan ? I have studied unicode's uniqueness-rules, bidi-algorithm and many others and feel almost

RE: Roundtripping in Unicode

2004-12-13 Thread Lars Kristan
Title: RE: Roundtripping in Unicode Marcin 'Qrczak' Kowalczyk wrote: UTF-8 is painful to process in the first place. You are making it even harder by demanding that all functions which process UTF-8 do something sensible for bytes which don't form valid UTF-8. They even can't temporarily

Re: RE: RE: Roundtripping in Unicode

2004-12-13 Thread Philippe VERDY
From : Lars Kristan Philippe VERDY wrote: If a source sequence is invalid, and you want to preserve it, then this sequence must remain invalid if you change its encoding. So there's no need for Unicode to assign valid code points for invalid source data. Using invalid UTF-16

RE: Roundtripping in Unicode

2004-12-13 Thread Kenneth Whistler
Lars Kristan stated: I said, the choice is yours. My proposal does not prevent you from doing it your way. You don't need to change anything and it will still work the way it worked before. OK? I just want 128 codepoints so I can make my own choice. You have them: U+EE80..U+EEFF, which are

Re: Nicest UTF

2004-12-13 Thread Philippe Verdy
From: D. Starner [EMAIL PROTECTED] Some won't convert any and will just start using UTF-8 for new ones. And this should be allowed. Why should it be allowed? You can't mix items with different unlabeled encodings willy-nilly. All you're going to get, all you can expect to get is a mess. When you

Re: RE: Roundtripping in Unicode

2004-12-13 Thread Doug Ewell
Philippe VERDY wrote: (In fact I also think that mapping invalid sequences to U+FFFD is also an error, because U+FFFD is valid, and the presence of the encoding error in the source is lost, and will not throw exceptions in further processings of the remapped text, unless the application

Re: Subj: Displaying Chinese characters and Chu Nom characters

2004-12-13 Thread Doug Ewell
Paul stabbedupp at yahoo dot com wrote: ,,the Unicode number assigned to this character is U+21a38, and when inputting this number into the Unicode look-up chart, the box is blank. Is there a way i can make characters up in this block of Unicode display in my browser, and if so, what would i

Re: Subj: Displaying Chinese characters and Chu Nom characters

2004-12-13 Thread James Kass
Paul wrote: ,,the Unicode number assigned to this character is U+21a38, and when inputting this number into the Unicode look-up chart, the box is blank. Is there a way i can make characters up in this block of Unicode display in my browser, and if so, what would i have to do?