Re: CRLF vs. LF (was Re: Unicode and end users)

2002-02-27 Thread Markus Scherer
Doug Ewell wrote: Paragraph breaking implies that line breaking is also performed, and that the two are different somehow. LS and PS probably should not be treated as synonyms. Right, but we are talking about plain text editors here. I would expect a plain text editor to treat LS and PS

Re: CRLF vs. LF (was Re: Unicode and end users)

2002-02-26 Thread Markus Scherer
Doug Ewell wrote: SC UniPad can read and write text files: - using LF, CR, CRLF, or LS (U+2028); Great, and I know about UniPad, but more people have Windows Notepad and other system-level editors. Why does UniPad not support NL and PS? One thing it cannot do is maintain different line

Re: CRLF vs. LF (was Re: Unicode and end users)

2002-02-26 Thread Doug Ewell
Markus Scherer [EMAIL PROTECTED] wrote: Why does UniPad not support NL and PS? I don't work for Sharmahd, so the following is speculation. Despite what UAX #13 says, I don't know of any editor or other text tool that handles U+0085 as a newline character. The big debate has always been

CRLF vs. LF (was Re: Unicode and end users)

2002-02-21 Thread Lars Kristan
David Hopwood wrote: Lars Kristan wrote: Doug Ewell wrote: fine (as are LF-CRLF, stripped BOM's, and maybe even some edge cases like converting between tabs and spaces). If there are any security or spoofing concerns, it's best to leave everything completely untouched. I

Re: CRLF vs. LF (was Re: Unicode and end users)

2002-02-21 Thread Michael \(michka\) Kaplan
From: Lars Kristan [EMAIL PROTECTED] A - When writing, no CR characters will be written (unless read from a file). Many programs (like notepad) will not display such files correctly. It is a good question whether this is my problem or notepad's. Yours -- since you are feeding it files that

RE: CRLF vs. LF (was Re: Unicode and end users)

2002-02-21 Thread Murray Sargent
. Murray -Original Message- From: Michael (michka) Kaplan [mailto:[EMAIL PROTECTED]] Sent: Thursday, February 21, 2002 8:40 AM To: Lars Kristan; 'David Hopwood'; [EMAIL PROTECTED] Subject: Re: CRLF vs. LF (was Re: Unicode and end users) Importance: Low From: Lars Kristan [EMAIL PROTECTED

RE: CRLF vs. LF (was Re: Unicode and end users)

2002-02-21 Thread Chris Pratley
Microsoft Word Sent with OfficeXP on WindowsXP -Original Message- From: Markus Scherer [mailto:[EMAIL PROTECTED]] Sent: February 21, 2002 2:29 PM To: unicode list Subject: Re: CRLF vs. LF (was Re: Unicode and end users) Murray Sargent wrote: be fairly easy to have an option to write

Re: CRLF vs. LF (was Re: Unicode and end users)

2002-02-21 Thread Doug Ewell
Markus Scherer [EMAIL PROTECTED] wrote: I think there is no doubt very high interest in editors - especially system default editors like notepad - that can both - read plain text using any style line breaks (see Unicode TR) - write plain text at least in LF or CRLF if not all the others too

Re: Unicode and end users

2002-02-19 Thread David Hopwood
-BEGIN PGP SIGNED MESSAGE- Lars Kristan wrote: John Cowan wrote: Frankly, your problem is insoluble, because you have set up self-contradictory requirements. Suppose you are dealing with a filesystem where some names are to be interpreted as Latin-1 and others as Latin-2. The

Re: Unicode and end users

2002-02-19 Thread David Hopwood
-BEGIN PGP SIGNED MESSAGE- Lars Kristan wrote: Doug Ewell wrote: fine (as are LF-CRLF, stripped BOM's, and maybe even some edge cases like converting between tabs and spaces). If there are any security or spoofing concerns, it's best to leave everything completely untouched.

Re: Unicode and end users

2002-02-19 Thread John Cowan
David Hopwood scripsit: (I've just checked whether NTFS allows ill-formed UTF-16 filenames; it does, at least on NT4.0, but you could reasonably treat that as an error.) NTFS filenames are UCS-2, not UTF-16, so ill-formed has no meaning. -- John Cowan http://www.ccil.org/~cowan

Re: Unicode and end users

2002-02-19 Thread David Hopwood
-BEGIN PGP SIGNED MESSAGE- John Cowan wrote: David Hopwood scripsit: (I've just checked whether NTFS allows ill-formed UTF-16 filenames; it does, at least on NT4.0, but you could reasonably treat that as an error.) NTFS filenames are UCS-2, not UTF-16, so ill-formed has no

Re: Unicode and end users - UTF-8B

2002-02-19 Thread Markus Scherer
Lars Kristan wrote: ... The same thing should work the other way around, store Windows filenames directly into a UTF-16 database and use UTF-8 = UTF-16 conversion for UNIX filenames. Hoping that some day most of the data will be UTF-8 makes this even more appealing. As for any data that is

RE: Unicode and end users

2002-02-19 Thread Chris Pratley
] Subject: Re: Unicode and end users David Hopwood scripsit: (I've just checked whether NTFS allows ill-formed UTF-16 filenames; it does, at least on NT4.0, but you could reasonably treat that as an error.) NTFS filenames are UCS-2, not UTF-16, so ill-formed has no meaning. -- John Cowan

RE: Unicode and end users

2002-02-19 Thread Chris Pratley
Kristan; 'Asmus Freytag'; [EMAIL PROTECTED] Subject: Re: Unicode and end users Generally speaking, the best reader to do it all in is IE you can open the text file, change the encoding, and then copy/paste it out into any other file. Not that Open as wouldn't be cool (it would save me some steps

RE: Unicode and end users

2002-02-18 Thread Lars Kristan
Doug Ewell wrote: fine (as are LF-CRLF, stripped BOM's, and maybe even some edge cases like converting between tabs and spaces). If there are any security or spoofing concerns, it's best to leave everything completely untouched. I see this as a good reason for NOT using BOM in UTF-8

RE: Unicode and end users

2002-02-18 Thread Lars Kristan
Why would anyone, faced with a UTF-8 file that contains invalid sequences, want to retain the invalid sequences, much less convert the file to another encoding form that either (a) preserves the invalid sequences or (b) leaves a marker showing where they were? Invalid sequences are garbage.

Re: Unicode and end users

2002-02-18 Thread John Cowan
Lars Kristan scripsit: I need to store UNIX filenames in a UTF-16 database residing on Windows. If I use ANSI-Unicode, there is no problem. However, if I have a filesystem with filenames mainly in UTF-8? Nobody can guarantee that all of them will be in UTF-8. Some may still be in ANSI (well

RE: Unicode and end users

2002-02-18 Thread Lars Kristan
John Cowan wrote: Frankly, your problem is insoluble, because you have set up self-contradictory requirements. Suppose you are dealing with a filesystem where some names are to be interpreted as Latin-1 and others as Latin-2. The kernel will give you absolutely no help about which

Re: Unicode and end users

2002-02-18 Thread Doug Ewell
Lars Kristan [EMAIL PROTECTED] wrote: Most http servers have a functionality to display filesystem and allow changing directory and opening files. Hmmm, marking the generated html file as UTF-8 would be a no-no thing then, unless the server guarantees that there are no illegal sequences in

RE: Unicode and end users

2002-02-18 Thread Lars Kristan
Asmus Freytag wrote: Ever since MS let the cat out of the bag with notepad, the rush is on for all tools to be upgraded to handle the situation. Fine, this is the real world. *Sigh* yes, it is. I understand why notepad needs this. For notepad, a file is either UTF-16 or an ANSI file.

Re: Unicode and end users

2002-02-16 Thread David Hopwood
-BEGIN PGP SIGNED MESSAGE- David Starner wrote: On Thu, Feb 14, 2002 at 03:15:24PM +, David Hopwood wrote: [re: a hypothetical charset that has almost all the properties of UTF-8] (The exception is that naïve substring searching could find a match starting part-way through a

Re: Unicode and end users

2002-02-16 Thread David Starner
On Fri, Feb 15, 2002 at 02:57:46PM +, David Hopwood wrote: Not having to add a few more lines of code to grep and sed is a good trade-off for a 50% penalty in encoding efficiency for Indic Southeast Asian scripts, Katakana, Hiragana and a few others? I don't think so. Not complicating

Re: Unicode and end users

2002-02-16 Thread Doug Ewell
David Hopwood [EMAIL PROTECTED] wrote: [I've thought about this a bit more, and I'm now convinced that it's useful to have a separate, standardised code for this - say U+FDEF ILL-FORMED INPUT MARKER. (Can noncharacters have names?) Nope. They're noncharacters. They do not exist; they never

Re: Unicode and end users

2002-02-16 Thread Asmus Freytag
At 12:37 PM 2/16/02 -0800, Doug Ewell wrote: Why would anyone, faced with a UTF-8 file that contains invalid sequences, want to retain the invalid sequences, much less convert the file to another encoding form that either (a) preserves the invalid sequences or (b) leaves a marker showing where

RE: Unicode and end users

2002-02-16 Thread Yves Arrouye
If foo is a US-ASCII string, grep foo file will work fine with any US-ASCII-superset charset for which non-ASCII characters do not use bytes 0x80, including the hypothetical one I described, with no possibility of a false match. However grep fóó file will work only if the current shell

Re: Unicode and end users

2002-02-15 Thread Martin Kochanski
At 17:45 14/02/02 -0800, Asmus Freytag wrote: In principle this [not having a BOM] is a requirement for data being labelled *external to the data* as being in either UTF-16BE or UTF-16LE (ditto for UTF-32). These formats *must not* have a BOM. UTF-8 should *never* contain the BOM. Even

Re: Unicode and end users

2002-02-15 Thread David Hopwood
-BEGIN PGP SIGNED MESSAGE- Lars Kristan wrote: Now, an opposite example. You execute ls ls.out, in a directory that has some filenames (say, old files) in ISO and many others in UTF-8. What format is the resulting file in? Well, since this is happening in the year 2016, the editor

Re: Unicode and end users

2002-02-15 Thread David Hopwood
-BEGIN PGP SIGNED MESSAGE- Doug Ewell wrote: Lars Kristan [EMAIL PROTECTED] wrote: This again makes me think that UTF-8 and UTF-16 are not both Unicode. No charset/CEF should be called Unicode; that would be ambiguous and inaccurate. Unicode is the name of a standard, a Coded

Re: Unicode and end users

2002-02-15 Thread David Hopwood
-BEGIN PGP SIGNED MESSAGE- Keld Jørn Simonsen wrote: On Thu, Feb 14, 2002 at 03:57:34PM +, Juliusz Chroboczek wrote: MK What we are trying to establish is the exact meaning that UNICODE MK ought to have - that is, if it can have one at all. In the Unix-like world, the term

RE: Unicode and end users

2002-02-15 Thread Rick Cameron
will explain why. Thanks! - rick cameron -Original Message- From: Tom Gewecke [mailto:[EMAIL PROTECTED]] Sent: Thursday, 14 February 2002 20:42 To: [EMAIL PROTECTED] Subject: RE: Unicode and end users Can you please expand on your statement that UTF-8 should never have a BOM? Having one

Re: Unicode and end users

2002-02-15 Thread David Starner
On Thu, Feb 14, 2002 at 03:15:24PM +, David Hopwood wrote: (The exception is that naïve substring searching could find a match starting part-way through a character - but it would be easy to reject false matches by looking at the previous byte.) But the fact that systems that can search

Re: Unicode and end users

2002-02-15 Thread David Starner
On Fri, Feb 15, 2002 at 09:47:54AM -0800, Rick Cameron wrote: If there is a file on disc called foo.txt, it is clearly not typed data. Thus, it appears to be Mr Davis' opinion that when such a file contains UTF-8 data, it is quite appropriate for there to be a BOM at the start. In a global

RE: Unicode and end users

2002-02-15 Thread Rick Cameron
[mailto:[EMAIL PROTECTED]] Sent: Friday, 15 February 2002 11:24 To: Rick Cameron Cc: [EMAIL PROTECTED] Subject: Re: Unicode and end users On Fri, Feb 15, 2002 at 09:47:54AM -0800, Rick Cameron wrote: If there is a file on disc called foo.txt, it is clearly not typed data. Thus, it appears

Re: Unicode and end users

2002-02-15 Thread Keld Jørn Simonsen
On Thu, Feb 14, 2002 at 03:15:57PM +, David Hopwood wrote: -BEGIN PGP SIGNED MESSAGE- Keld Jørn Simonsen wrote: On Thu, Feb 14, 2002 at 03:57:34PM +, Juliusz Chroboczek wrote: MK What we are trying to establish is the exact meaning that UNICODE MK ought to have - that

Unicode and end users

2002-02-14 Thread Martin Kochanski
First, let me thank everyone for their wise and experienced comments. This is exactly what this sort of list should be for... For the sake of clarity, let me define two terms: 1. Unicode means Unicode. 2. UNICODE means what an end user thinks when he sees the characters U, n, i, c, o, d, e on

RE: Unicode and end users

2002-02-14 Thread Lars Kristan
an 'equally good Unicode format'. And why do I keep this in the Unicode and end users thread? Because invalid sequences (and old filenames) are a fact that users WILL experience and pretending that this is just a case of non-conformance is not in the best interest of the users. Lars Kristan

Re: Unicode and end users

2002-02-14 Thread Juliusz Chroboczek
MK What we are trying to establish is the exact meaning that UNICODE MK ought to have - that is, if it can have one at all. In the Unix-like world, the term ``UTF-8'' has been used quite consistently, and most documentation avoids using Unicode for a disk format (using it for the consortium,

Re: Unicode and end users

2002-02-14 Thread Keld Jørn Simonsen
On Thu, Feb 14, 2002 at 03:57:34PM +, Juliusz Chroboczek wrote: MK What we are trying to establish is the exact meaning that UNICODE MK ought to have - that is, if it can have one at all. In the Unix-like world, the term ``UTF-8'' has been used quite consistently, and most documentation

Re: Unicode and end users

2002-02-14 Thread Doug Ewell
Lars Kristan [EMAIL PROTECTED] wrote: AFAIK, UTF-8 files are NOT supposed to have a BOM in them. Different operating systems and applications have different preferences. There is no universal right or wrong about this. This is unfortunate, but true. Why is UTF-16 percieved as UNICODE? Well,

Re: Unicode and end users

2002-02-14 Thread David Starner
On Thu, Feb 14, 2002 at 05:46:46PM +0100, Keld Jørn Simonsen wrote: I would rather recommend that you write ISO 10646 UTF-8 as the ISO standard is a standard in many countries while Unicode is not. *Grumble*. The whole point of this discussion is making it clear for the users. Unicode is more

Re: Unicode and end users

2002-02-14 Thread Michael Everson
At 14:16 -0600 2002-02-14, David Starner wrote: The whole point of this discussion is making it clear for the users. Unicode is more clear for more users than ISO 10646 is. There is no reason to use ISO 10646, besides pedanticness. It is ISO/IEC 10646. -- Michael Everson *** Everson Typography

Re: Unicode and end users

2002-02-14 Thread Michael \(michka\) Kaplan
From: Michael Everson [EMAIL PROTECTED] At 14:16 -0600 2002-02-14, David Starner wrote: There is no reason to use ISO 10646, besides pedanticness. It is ISO/IEC 10646. The defense rests. MichKa Michael Kaplan Trigeminal Software, Inc. -- http://www.trigeminal.com/

Re: Unicode and end users

2002-02-14 Thread Asmus Freytag
At 09:22 AM 2/14/02 +, Martin Kochanski wrote: Are there, in fact, many circumstances in which it is necessary for an end user to create files that do *not* have a BOM at the beginning? In principle this is a requirement for data being labelled *external to the date* as being in either

RE: Unicode and end users

2002-02-14 Thread Rick Cameron
: Asmus Freytag [mailto:[EMAIL PROTECTED]] Sent: Thursday, 14 February 2002 17:46 To: Martin Kochanski; [EMAIL PROTECTED] Subject: Re: Unicode and end users At 09:22 AM 2/14/02 +, Martin Kochanski wrote: Are there, in fact, many circumstances in which it is necessary for an end user to create

RE: Unicode and end users

2002-02-14 Thread Yves Arrouye
UTF-8 should *never* contain the BOM. But has been pointed out, it is common practice for Microsoft, and also for ICU's genrb tool, for example, which uses the BOM to autodetect the encoding. The more example you'll see of that, the more people will use the BOM (now, can't we all use -*-

RE: Unicode and end users

2002-02-14 Thread Tom Gewecke
Can you please expand on your statement that UTF-8 should never have a BOM? Having one makes it very easy to distinguish a text file that contains UTF-8 from one that contains text in the system default MBCS encoding. You may not be surprised to learn that Microsoft (or, at least, one of its