RE: Invalid UTF-8 sequences (was: Re: Nicest UTF)

2004-12-11 Thread Lars Kristan
Title: RE: Invalid UTF-8 sequences (was: Re: Nicest UTF) Arcane Jill responded: Windows filesystems do know what encoding they use. Err, not really. MS-DOS *need to know* the encoding to use, a bit like a *nix application that displays filenames need to know the encoding to use

RE: Invalid UTF-8 sequences (was: Re: Nicest UTF)

2004-12-11 Thread Lars Kristan
Title: RE: Invalid UTF-8 sequences (was: Re: Nicest UTF) John Cowan wrote: However, although they are *technically* octet sequences, they are *functionally* character strings. That's the issue. Nicely put! But UTC does not seem to care. The point I'm making is that *whatever* you do

RE: Invalid UTF-8 sequences (was: Re: Nicest UTF)

2004-12-11 Thread Lars Kristan
Title: RE: Invalid UTF-8 sequences (was: Re: Nicest UTF) Kenneth Whistler wrote: Lars responded: ... Whatever the solutions for representation of corrupt data bytes or uninterpreted data bytes on conversion to Unicode may be, that is irrelevant to the concerns on whether

RE: Invalid UTF-8 sequences (was: Re: Nicest UTF)

2004-12-11 Thread Lars Kristan
Title: RE: Invalid UTF-8 sequences (was: Re: Nicest UTF) Kenneth Whistler wrote: Further, as it turns out that Lars is actually asking for standardizing corrupt UTF-8, a notion that isn't going to fly even two feet, I think the whole idea is going to be a complete non-starter

RE: Invalid UTF-8 sequences (was: Re: Nicest UTF)

2004-12-11 Thread Lars Kristan
Title: RE: Invalid UTF-8 sequences (was: Re: Nicest UTF) Philippe Verdy wrote: This is a known caveat even for Unix, when you look at the tricky details of the support of Windows file sharing through Samba, when the client requests a file with a short 8.3 name, that a partition used

Re: Invalid UTF-8 sequences (was: Re: Nicest UTF)

2004-12-09 Thread Antoine Leca
On Monday, December 6th, 2004 20:52Z John Cowan va escriure: Doug Ewell scripsit: Now suppose you have a UNIX filesystem, containing filenames in a legacy encoding (possibly even more than one). If one wants to switch to UTF-8 filenames, what is one supposed to do? Convert all filenames to

Re: Invalid UTF-8 sequences (was: Re: Nicest UTF)

2004-12-09 Thread Arcane Jill
-Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Behalf Of Antoine Leca Sent: 09 December 2004 11:29 To: Unicode Mailing List Subject: Re: Invalid UTF-8 sequences (was: Re: Nicest UTF) Windows filesystems do know what encoding they use. Err, not really. MS-DOS *need

Re: Invalid UTF-8 sequences (was: Re: Nicest UTF)

2004-12-09 Thread Philippe Verdy
From: Antoine Leca [EMAIL PROTECTED] Err, not really. MS-DOS *need to know* the encoding to use, a bit like a *nix application that displays filenames need to know the encoding to use the correct set of glyphs (but constrainst are much more heavy.) Also Windows NT Unicode applications know it,

RE: Invalid UTF-8 sequences (was: Re: Nicest UTF)

2004-12-08 Thread Lars Kristan
Title: RE: Invalid UTF-8 sequences (was: Re: Nicest UTF) Doug Ewell wrote: How do file names work when the user changes from one SBCS to another (let's ignore UTF-8 for now) where the interpretation is different? For example, byte C3 is U+00C3, A with tilde () in ISO 8859-1, but U+0102

RE: Invalid UTF-8 sequences (was: Re: Nicest UTF)

2004-12-08 Thread Lars Kristan
Title: RE: Invalid UTF-8 sequences (was: Re: Nicest UTF) Needless to say, these systems were badly designed at their origin, and newer filesystems (and OS APIs) offer much better alternative, by either storing explicitly on volumes which encoding it uses, or by forcing all user

RE: Invalid UTF-8 sequences (was: Re: Nicest UTF)

2004-12-08 Thread Lars Kristan
Title: RE: Invalid UTF-8 sequences (was: Re: Nicest UTF) Kenneth Whistler wrote: I'm going to step in here, because this argument seems to be generating more heat than light. I agree, and I thank you for that. First, I'm going to summarize what I think Lars Kristan is suggesting

Re: Invalid UTF-8 sequences (was: Re: Nicest UTF)

2004-12-08 Thread Kenneth Whistler
John Cowan responded: Storage of UNIX filenames on Windows databases, for example, ^^ O.k., I just quoted this back from the original email, but it really is a complete misconception of the issue for databases. Windows databases is a

RE: Invalid UTF-8 sequences (was: Re: Nicest UTF)

2004-12-08 Thread Kenneth Whistler
Lars responded: ... Whatever the solutions for representation of corrupt data bytes or uninterpreted data bytes on conversion to Unicode may be, that is irrelevant to the concerns on whether an application is using UTF-8 or UTF-16 or UTF-32. The important fact is that if you have an

RE: Invalid UTF-8 sequences (was: Re: Nicest UTF)

2004-12-07 Thread Lars Kristan
Title: RE: Invalid UTF-8 sequences (was: Re: Nicest UTF) Doug Ewell wrote: John Cowan jcowan at reutershealth dot com wrote: Windows filesystems do know what encoding they use. But a filename on a Unix(oid) file system is a mere sequence of octets, of which only 00 and 2F

Re: Invalid UTF-8 sequences (was: Re: Nicest UTF)

2004-12-07 Thread Philippe Verdy
RE: Invalid UTF-8 sequences (was: Re: Nicest UTF)I know wht you mean here: most Linux/Unix filesystems (as well as many legacy filesystems for Windows and MacOS...) do not track the encoding with which filenames were encoded and, depending on local user preferences when that user created

RE: Invalid UTF-8 sequences (was: Re: Nicest UTF)

2004-12-07 Thread Kenneth Whistler
Lars, I'm going to step in here, because this argument seems to be generating more heat than light. I never said it doesn't violate any existing rules. Stating that it does, doesn't help a bit. Rules can be changed. I ask you to step back and try to see the big picture. First, I'm going to

Re: Invalid UTF-8 sequences (was: Re: Nicest UTF)

2004-12-07 Thread John Cowan
Kenneth Whistler scripsit: Storage of UNIX filenames on Windows databases, for example, can be done with BINARY fields, which correctly capture the identity of them as what they are: an unconvertible array of byte values, not a convertible string in some particular code page. This solution,

Re: Invalid UTF-8 sequences (was: Re: Nicest UTF)

2004-12-07 Thread Doug Ewell
Philippe Verdy verdy underscore p at wanadoo dot fr wrote: An alternative can then be a mixed encoding selection: - choose a legacy encoding that will most often be able to represent valid filenames without loss of information (for example ISO-8859-1, or Cp1252). - encode the filename with

Re: Invalid UTF-8 sequences (was: Re: Nicest UTF)

2004-12-07 Thread Doug Ewell
Kenneth Whistler kenw at sybase dot com wrote: I do not think this is a proposal to amend UTF-8 to allow invalid sequences. So we should get that off the table. I hope you are right. Apparently Lars is currently using PUA U+E080..U+E0FF (or U+EE80..U+EEFF ?) for this purpose, enabling the

Re: Invalid UTF-8 sequences (was: Re: Nicest UTF)

2004-12-07 Thread Doug Ewell
RE: Invalid UTF-8 sequences (was: Re: Nicest UTF) Lars Kristan wrote: I never said it doesn't violate any existing rules. Stating that it does, doesn't help a bit. Rules can be changed. Assuming we understand the consequences. And that is what we should be discussing. By stating what should

Invalid UTF-8 sequences (was: Re: Nicest UTF)

2004-12-06 Thread Doug Ewell
RE: Nicest UTFLars Kristan wrote: I could not disagree more with the basic premise of Lars' post. It is a fundamental and critical mistake to try to extend Unicode with non-standard code unit sequences to handle data that cannot be, or has not been, converted to Unicode from a legacy

Re: Invalid UTF-8 sequences (was: Re: Nicest UTF)

2004-12-06 Thread John Cowan
Doug Ewell scripsit: Now suppose you have a UNIX filesystem, containing filenames in a legacy encoding (possibly even more than one). If one wants to switch to UTF-8 filenames, what is one supposed to do? Convert all filenames to UTF-8? Well, yes. Doesn't the file system dictate what

Re: Invalid UTF-8 sequences (was: Re: Nicest UTF)

2004-12-06 Thread Doug Ewell
John Cowan jcowan at reutershealth dot com wrote: Windows filesystems do know what encoding they use. But a filename on a Unix(oid) file system is a mere sequence of octets, of which only 00 and 2F are interpreted. (Filenames containing 20, and especially 0A, are annoying to handle with