Re: Nicest UTF

2004-12-11 Thread Marcin 'Qrczak' Kowalczyk
Philippe Verdy [EMAIL PROTECTED] writes: [...] This was later amended in an errata for XML 1.0 which now says that the list of code points whose use is *discouraged* (but explicitly *not* forbidden) for the Char production is now: [...] Ugh, it's a mess... IMHO Unicode is partially to blame,

RE: When to validate?

2004-12-11 Thread Lars Kristan
Title: RE: When to validate? Antoine Leca wrote: As a result, your strings are likely to be some stuctures. Then, it is pretty easy to add some s_valid flag, and you are done. Is that a proven technique? I'd say not. The flag would only be valid for as long as the string is not changed. You

RE: Invalid UTF-8 sequences (was: Re: Nicest UTF)

2004-12-11 Thread Lars Kristan
Title: RE: Invalid UTF-8 sequences (was: Re: Nicest UTF) Arcane Jill responded: Windows filesystems do know what encoding they use. Err, not really. MS-DOS *need to know* the encoding to use, a bit like a *nix application that displays filenames need to know the encoding to use the

Re: US-ASCII (was: Re: Invalid UTF-8 sequences)

2004-12-11 Thread Michael Everson
At 17:38 -0800 2004-12-10, Asmus Freytag wrote: Other examples of apparent redundancy, are Cakes - Keks (German), plural Kekse Baby - bebis (Swedish), plural bebissar and there are many more such examples. In Ireland sometime in the early nineties, the Allied Irish Bank became AIB Bank, the

RE: Invalid UTF-8 sequences (was: Re: Nicest UTF)

2004-12-11 Thread Lars Kristan
Title: RE: Invalid UTF-8 sequences (was: Re: Nicest UTF) John Cowan wrote: However, although they are *technically* octet sequences, they are *functionally* character strings. That's the issue. Nicely put! But UTC does not seem to care. The point I'm making is that *whatever* you do,

Re: US-ASCII (was: Re: Invalid UTF-8 sequences)

2004-12-11 Thread Johannes Bergerhausen
Am 11.12.2004 um 04:32 schrieb Clark Cox: There are always the classics: ATM Machine and PIN Number Here in germany, they say ASCII-Code. :-) Johannes

RE: Invalid UTF-8 sequences (was: Re: Nicest UTF)

2004-12-11 Thread Lars Kristan
Title: RE: Invalid UTF-8 sequences (was: Re: Nicest UTF) Kenneth Whistler wrote: Lars responded: ... Whatever the solutions for representation of corrupt data bytes or uninterpreted data bytes on conversion to Unicode may be, that is irrelevant to the concerns on whether an

Re: Roundtripping in Unicode

2004-12-11 Thread Marcin 'Qrczak' Kowalczyk
Lars Kristan [EMAIL PROTECTED] writes: It's essential that any UTF-n can be translated to any other without loss of data. Because it allows to use an implementation of the given functionality which represents data in any form, not necessarily the form we have at hand, as long as correctness

infinite combinations, was Re: Nicest UTF

2004-12-11 Thread Peter R. Mueller-Roemer
Philippe Verdy wrote: The repertoire of all possible combining characters sequences is already infinite in Unicode, as well as the number of default grapheme clusters they can represent. For a fixed length of combining character sequence (base + 3 combining marks is the most I have seen

RE: Roundtripping in Unicode

2004-12-11 Thread Lars Kristan
Title: RE: Roundtripping in Unicode Marcin 'Qrczak' Kowalczyk wrote: Lars Kristan [EMAIL PROTECTED] writes: The other name for this is roundtripping. Currently, Unicode allows a roundtrip UTF-16=UTF-8=UTF-16. For any data. But there are several reasons why a UTF-8=UTF-16(32)=UTF-8

Re: Please RSVP... (was: US-ASCII)

2004-12-11 Thread John Cowan
Philippe Verdy scripsit: Didn't know that. Is this a very recent use? It's been used as an English verb, adjective, and noun for 30-40 years and perhaps much longer: see below. In France, I think that RSVP was introduced and widely used at end of telegraphic messages (that contained lots of

Re: US-ASCII (was: Re: Invalid UTF-8 sequences)

2004-12-11 Thread Peter Kirk
On 11/12/2004 02:29, Mark Davis wrote: This is just a confusion among the hoi polloi. Mark But such things happen not just among the German and Swedish polloi, but even in the crowning heights of the English language. The word cherubims is used many times in the King James Bible and at least

RE: Invalid UTF-8 sequences (was: Re: Nicest UTF)

2004-12-11 Thread Lars Kristan
Title: RE: Invalid UTF-8 sequences (was: Re: Nicest UTF) Kenneth Whistler wrote: Further, as it turns out that Lars is actually asking for standardizing corrupt UTF-8, a notion that isn't going to fly even two feet, I think the whole idea is going to be a complete non-starter.

Re: US-ASCII (was: Re: Invalid UTF-8 sequences)

2004-12-11 Thread Doug Ewell
Michael Everson everson at evertype dot com wrote: In Ireland sometime in the early nineties, the Allied Irish Bank became AIB Bank, the Allied Irish Bank Bank. Israel Discount Bank of New York regularly refers to itself as IDB Bank. -Doug Ewell Fullerton, California

RE: Software support costs (was: Nicest UTF

2004-12-11 Thread Carl W. Brown
Philippe, However, within the program itself UTF-8 presents a problem when looking for specific data in memory buffers. It is nasty, time consuming and error prone. Mapping UTF-16 to code points is a snap as long as you do not have a lot of surrogates. If you do then probably UTF-32 should be

Re: Nicest UTF

2004-12-11 Thread Philippe Verdy
From: Marcin 'Qrczak' Kowalczyk [EMAIL PROTECTED] Regarding A, I see three choices: 1. A string is a sequence of code points. 2. A string is a sequence of combining character sequences. 3. A string is a sequence of code points, but it's encouraged to process it in groups of combining character

Re: Please RSVP... (was: US-ASCII)

2004-12-11 Thread Curtis Clark
on 2004-12-11 09:21 John Cowan wrote: It's been used as an English verb, adjective, and noun for 30-40 years and perhaps much longer: see below. Longer. I can attest from my youth in the 1950s that my parents considered it ordinary English usage, and in fact knew of its origin. -- Curtis Clark

RE: Roundtripping in Unicode

2004-12-11 Thread Lars Kristan
Title: RE: Roundtripping in Unicode Marcin 'Qrczak' Kowalczyk wrote: Lars Kristan [EMAIL PROTECTED] writes: All assigned codepoints do roundtrip even in my concept. But unassigned codepoints are not valid data. Please make up your mind: either they are valid and programs are

RE: Invalid UTF-8 sequences (was: Re: Nicest UTF)

2004-12-11 Thread Lars Kristan
Title: RE: Invalid UTF-8 sequences (was: Re: Nicest UTF) Philippe Verdy wrote: This is a known caveat even for Unix, when you look at the tricky details of the support of Windows file sharing through Samba, when the client requests a file with a short 8.3 name, that a partition used

Re: Roundtripping in Unicode

2004-12-11 Thread Doug Ewell
RE: Roundtripping in Unicode Lars Kristan wrote: All assigned codepoints do roundtrip even in my concept. But unassigned codepoints are not valid data. Please make up your mind: either they are valid and programs are required to accept them, or they are invalid and programs are required to

Re: Please RSVP... (was: US-ASCII)

2004-12-11 Thread Séamas Ó Brógáin
John wrote: As far as I know, they were first used in formal invitations (to weddings, funerals, dances, etc.) in the corner of the card, as both shorter and more fancy than the older phrase The favor of your reply is requested. This is correct. The practice dates from the end of the nineteenth

RE: Nicest UTF

2004-12-11 Thread Lars Kristan
Title: RE: Nicest UTF Missed this one the other day, but cannot let it go... Marcin 'Qrczak' Kowalczyk wrote: filenames, what is one supposed to do? Convert all filenames to UTF-8? Yes. Who will do that? A system administrator (because he has access to all files). My my,

Re: Roundtripping in Unicode

2004-12-11 Thread Philippe Verdy
From: Doug Ewell [EMAIL PROTECTED] Lars Kristan wrote: I am sure one of the standardizers will find a Unicodally correct way of putting it. I can't even understand that paragraph, let alone paraphrase it. My understanding of his question and my reponse to his problem is that you MUST not use

Re: Please RSVP... (was: US-ASCII)

2004-12-11 Thread Philippe Verdy
From: Séamas Ó Brógáin [EMAIL PROTECTED] John wrote: As far as I know, they were first used in formal invitations (to weddings, funerals, dances, etc.) in the corner of the card, as both shorter and more fancy than the older phrase The favor of your reply is requested. This is correct. The

Re: Please RSVP... (was: US-ASCII)

2004-12-11 Thread Michael Everson
At 01:12 +0100 2004-12-12, Philippe Verdy wrote: I would not be surprised if this acronym was defined in some internationally accepted set of abbreviations used by telegraphists, so that their clients became exposed to these acronyms when reading telegrams received from their local post office

Re: infinite combinations, was Re: Nicest UTF

2004-12-11 Thread Philippe Verdy
From: Peter R. Mueller-Roemer [EMAIL PROTECTED] For a fixed length of combining character sequence (base + 3 combining marks is the most I have seen graphically distinguishable) the repertore is still finite. I do think that you are underestimating the repertoire. Also Unicode does NOT define

Re: Please RSVP... (was: US-ASCII)

2004-12-11 Thread Philippe Verdy
From: Michael Everson [EMAIL PROTECTED] Nonsense. You might as well try to explain SPQR on the same basis. I won't. I know that SPQR was used on architectural constructions as a symbol of the Roman Empire, and it was a wellknown acronym of a Latin expression. It largely predates the invention

Re: Nicest UTF

2004-12-11 Thread D. Starner
Marcin 'Qrczak' Kowalczyk writes: But demanding that each program which searches strings checks for combining classes is I'm afraid too much. How is it any different from a case-insenstive search? Does \n followed by a combining code point start a new line? The Standard says no,

RE: Nicest UTF

2004-12-11 Thread D. Starner
Lars Kristan writes: A system administrator (because he has access to all files). My my, you are assuming all files are in the same encoding. And what about all the references to the files in scripts? In configuration files? Soft links? If you want to break things, this is definitely the

Re: Roundtripping in Unicode

2004-12-11 Thread Philippe Verdy
RE: Roundtripping in UnicodeMy view about this problem of roundtripping is that if data, supposed to contain only valid UTF-8 sequences, contains some invalid byte sequences that still need to be roundtripped to some code point for internal management that can be roundtripped later to the

RE: When to validate?

2004-12-11 Thread Lars Kristan
Title: RE: When to validate? Andy Heninger wrote: Some important things in designing a function API are o Fully define what the behavior is. With a function like tolower(), you could leave malformed sequences unaltered; you could replace them with some substitution character; you

Re: Nicest UTF

2004-12-11 Thread Marcin 'Qrczak' Kowalczyk
D. Starner [EMAIL PROTECTED] writes: This implies that every programmer needs an indepth knowledge of Unicode to handle simple strings. There is no way to avoid that. Then there's no way that we're ever going to get reliable Unicode support. This is probably true. I wonder whether

Roundtripping in Unicode (was RE: Invalid UTF-8 sequences)

2004-12-11 Thread Lars Kristan
Title: Roundtripping in Unicode (was RE: Invalid UTF-8 sequences) Marcin 'Qrczak' Kowalczyk wrote: Lars Kristan [EMAIL PROTECTED] writes: Quite close. Except for the fact that: * U+EE93 is represented in UTF-32 as 0xEE93 * U+EE93 is represented in UTF-16 as 0xEE93 * U+EE93 is

Re: Roundtripping in Unicode

2004-12-11 Thread Marcin 'Qrczak' Kowalczyk
Lars Kristan [EMAIL PROTECTED] writes: The other name for this is roundtripping. Currently, Unicode allows a roundtrip UTF-16=UTF-8=UTF-16. For any data. But there are several reasons why a UTF-8=UTF-16(32)=UTF-8 roundtrip is more valuable, even if it means that the other roundtrip is no

RE: Roundtripping in Unicode

2004-12-11 Thread Lars Kristan
Title: RE: Roundtripping in Unicode Marcin 'Qrczak' Kowalczyk wrote: Roundtrip for valid data is of course essential and needs to be preserved. Your proposal does not do this. All assigned codepoints do roundtrip even in my concept. But unassigned codepoints are not valid data.

Re: Roundtripping in Unicode

2004-12-11 Thread Marcin 'Qrczak' Kowalczyk
Lars Kristan [EMAIL PROTECTED] writes: All assigned codepoints do roundtrip even in my concept. But unassigned codepoints are not valid data. Please make up your mind: either they are valid and programs are required to accept them, or they are invalid and programs are required to reject them.