Doug Ewell wrote:
Paragraph breaking implies that line breaking is also performed, and that
the two are different somehow. LS and PS probably should not be treated
as synonyms.
Right, but we are talking about plain text editors here.
I would expect a plain text editor to treat LS and PS
Doug Ewell wrote:
SC UniPad can read and write text files:
- using LF, CR, CRLF, or LS (U+2028);
Great, and I know about UniPad, but more people have Windows Notepad and other
system-level editors.
Why does UniPad not support NL and PS?
One thing it cannot do is maintain different line
Markus Scherer [EMAIL PROTECTED] wrote:
Why does UniPad not support NL and PS?
I don't work for Sharmahd, so the following is speculation.
Despite what UAX #13 says, I don't know of any editor or other text tool
that handles U+0085 as a newline character. The big debate has always
been
David Hopwood wrote:
Lars Kristan wrote:
Doug Ewell wrote:
fine (as are LF-CRLF, stripped BOM's, and maybe even
some edge cases
like converting between tabs and spaces). If there are any
security or spoofing concerns, it's best to leave
everything completely
untouched.
I
From: Lars Kristan [EMAIL PROTECTED]
A - When writing, no CR characters will be written (unless read from a
file). Many programs (like notepad) will not display such files correctly.
It is a good question whether this is my problem or notepad's.
Yours -- since you are feeding it files that
.
Murray
-Original Message-
From: Michael (michka) Kaplan [mailto:[EMAIL PROTECTED]]
Sent: Thursday, February 21, 2002 8:40 AM
To: Lars Kristan; 'David Hopwood'; [EMAIL PROTECTED]
Subject: Re: CRLF vs. LF (was Re: Unicode and end users)
Importance: Low
From: Lars Kristan [EMAIL PROTECTED
Microsoft Word
Sent with OfficeXP on WindowsXP
-Original Message-
From: Markus Scherer [mailto:[EMAIL PROTECTED]]
Sent: February 21, 2002 2:29 PM
To: unicode list
Subject: Re: CRLF vs. LF (was Re: Unicode and end users)
Murray Sargent wrote:
be fairly easy to have an option to write
Markus Scherer [EMAIL PROTECTED] wrote:
I think there is no doubt very high interest in editors - especially
system
default editors like notepad - that can both
- read plain text using any style line breaks (see Unicode TR)
- write plain text at least in LF or CRLF if not all the others too
-BEGIN PGP SIGNED MESSAGE-
Lars Kristan wrote:
John Cowan wrote:
Frankly, your problem is insoluble, because you have set up
self-contradictory requirements. Suppose you are dealing with a
filesystem where some names are to be interpreted as Latin-1 and others
as Latin-2. The
-BEGIN PGP SIGNED MESSAGE-
Lars Kristan wrote:
Doug Ewell wrote:
fine (as are LF-CRLF, stripped BOM's, and maybe even some edge cases
like converting between tabs and spaces). If there are any
security or spoofing concerns, it's best to leave everything completely
untouched.
David Hopwood scripsit:
(I've just checked whether NTFS allows ill-formed UTF-16 filenames; it does,
at least on NT4.0, but you could reasonably treat that as an error.)
NTFS filenames are UCS-2, not UTF-16, so ill-formed has no meaning.
--
John Cowan http://www.ccil.org/~cowan
-BEGIN PGP SIGNED MESSAGE-
John Cowan wrote:
David Hopwood scripsit:
(I've just checked whether NTFS allows ill-formed UTF-16 filenames;
it does, at least on NT4.0, but you could reasonably treat that as
an error.)
NTFS filenames are UCS-2, not UTF-16, so ill-formed has no
Lars Kristan wrote:
...
The same thing should work the other way around, store Windows filenames
directly into a UTF-16 database and use UTF-8 = UTF-16 conversion for UNIX
filenames. Hoping that some day most of the data will be UTF-8 makes this
even more appealing. As for any data that is
]
Subject: Re: Unicode and end users
David Hopwood scripsit:
(I've just checked whether NTFS allows ill-formed UTF-16 filenames; it
does,
at least on NT4.0, but you could reasonably treat that as an error.)
NTFS filenames are UCS-2, not UTF-16, so ill-formed has no meaning.
--
John Cowan
Kristan; 'Asmus Freytag'; [EMAIL PROTECTED]
Subject: Re: Unicode and end users
Generally speaking, the best reader to do it all in is IE you can
open
the text file, change the encoding, and then copy/paste it out into any
other file.
Not that Open as wouldn't be cool (it would save me some steps
Doug Ewell wrote:
fine (as are LF-CRLF, stripped BOM's, and maybe even some edge cases
like converting between tabs and spaces). If there are any
security or
spoofing concerns, it's best to leave everything completely untouched.
I see this as a good reason for NOT using BOM in UTF-8
Why would anyone, faced with a UTF-8 file that contains invalid
sequences, want to retain the invalid sequences, much less convert the
file to another encoding form that either (a) preserves the invalid
sequences or (b) leaves a marker showing where they were? Invalid
sequences are garbage.
Lars Kristan scripsit:
I need to store UNIX filenames in a UTF-16 database residing on Windows. If
I use ANSI-Unicode, there is no problem. However, if I have a filesystem
with filenames mainly in UTF-8? Nobody can guarantee that all of them will
be in UTF-8. Some may still be in ANSI (well
John Cowan wrote:
Frankly, your problem is insoluble, because you have set up
self-contradictory
requirements. Suppose you are dealing with a filesystem
where some names
are to be interpreted as Latin-1 and others as Latin-2. The
kernel will
give you absolutely no help about which
Lars Kristan [EMAIL PROTECTED] wrote:
Most http servers have a functionality to display filesystem and allow
changing directory and opening files. Hmmm, marking the generated html
file
as UTF-8 would be a no-no thing then, unless the server guarantees that
there are no illegal sequences in
Asmus Freytag wrote:
Ever since MS let the cat out of the bag with notepad, the
rush is on for
all tools to be upgraded to handle the situation. Fine, this
is the real
world.
*Sigh* yes, it is. I understand why notepad needs this. For notepad, a file
is either UTF-16 or an ANSI file.
-BEGIN PGP SIGNED MESSAGE-
David Starner wrote:
On Thu, Feb 14, 2002 at 03:15:24PM +, David Hopwood wrote:
[re: a hypothetical charset that has almost all the properties of UTF-8]
(The exception is that naïve substring searching could find a
match starting part-way through a
On Fri, Feb 15, 2002 at 02:57:46PM +, David Hopwood wrote:
Not having to add a few more lines of code to grep and sed is a good
trade-off for a 50% penalty in encoding efficiency for Indic Southeast
Asian scripts, Katakana, Hiragana and a few others? I don't think so.
Not complicating
David Hopwood [EMAIL PROTECTED] wrote:
[I've thought about this a bit more, and I'm now convinced that it's
useful to have a separate, standardised code for this - say
U+FDEF ILL-FORMED INPUT MARKER. (Can noncharacters have names?)
Nope. They're noncharacters. They do not exist; they never
At 12:37 PM 2/16/02 -0800, Doug Ewell wrote:
Why would anyone, faced with a UTF-8 file that contains invalid
sequences, want to retain the invalid sequences, much less convert the
file to another encoding form that either (a) preserves the invalid
sequences or (b) leaves a marker showing where
If foo is a US-ASCII string, grep foo file will work fine with any
US-ASCII-superset charset for which non-ASCII characters do not use
bytes 0x80, including the hypothetical one I described, with no
possibility of a false match. However grep fóó file will work only
if the current shell
At 17:45 14/02/02 -0800, Asmus Freytag wrote:
In principle this [not having a BOM] is a requirement for data being labelled
*external to the
data* as being in either UTF-16BE or UTF-16LE (ditto for UTF-32). These
formats *must not* have a BOM.
UTF-8 should *never* contain the BOM.
Even
-BEGIN PGP SIGNED MESSAGE-
Lars Kristan wrote:
Now, an opposite example. You execute ls ls.out, in a directory that has
some filenames (say, old files) in ISO and many others in UTF-8. What format
is the resulting file in? Well, since this is happening in the year 2016,
the editor
-BEGIN PGP SIGNED MESSAGE-
Doug Ewell wrote:
Lars Kristan [EMAIL PROTECTED] wrote:
This again makes me think that UTF-8 and UTF-16 are not both Unicode.
No charset/CEF should be called Unicode; that would be ambiguous and
inaccurate. Unicode is the name of a standard, a Coded
-BEGIN PGP SIGNED MESSAGE-
Keld Jørn Simonsen wrote:
On Thu, Feb 14, 2002 at 03:57:34PM +, Juliusz Chroboczek wrote:
MK What we are trying to establish is the exact meaning that UNICODE
MK ought to have - that is, if it can have one at all.
In the Unix-like world, the term
will explain why.
Thanks!
- rick cameron
-Original Message-
From: Tom Gewecke [mailto:[EMAIL PROTECTED]]
Sent: Thursday, 14 February 2002 20:42
To: [EMAIL PROTECTED]
Subject: RE: Unicode and end users
Can you please expand on your statement that UTF-8 should never have a
BOM? Having one
On Thu, Feb 14, 2002 at 03:15:24PM +, David Hopwood wrote:
(The exception is that naïve substring searching could find a
match starting part-way through a character - but it would be easy to
reject false matches by looking at the previous byte.)
But the fact that systems that can search
On Fri, Feb 15, 2002 at 09:47:54AM -0800, Rick Cameron wrote:
If there is a file on disc called foo.txt, it is clearly not typed data.
Thus, it appears to be Mr Davis' opinion that when such a file contains
UTF-8 data, it is quite appropriate for there to be a BOM at the start.
In a global
[mailto:[EMAIL PROTECTED]]
Sent: Friday, 15 February 2002 11:24
To: Rick Cameron
Cc: [EMAIL PROTECTED]
Subject: Re: Unicode and end users
On Fri, Feb 15, 2002 at 09:47:54AM -0800, Rick Cameron wrote:
If there is a file on disc called foo.txt, it is clearly not typed
data. Thus, it appears
On Thu, Feb 14, 2002 at 03:15:57PM +, David Hopwood wrote:
-BEGIN PGP SIGNED MESSAGE-
Keld Jørn Simonsen wrote:
On Thu, Feb 14, 2002 at 03:57:34PM +, Juliusz Chroboczek wrote:
MK What we are trying to establish is the exact meaning that UNICODE
MK ought to have - that
First, let me thank everyone for their wise and experienced comments. This is exactly
what this sort of list should be for...
For the sake of clarity, let me define two terms:
1. Unicode means Unicode.
2. UNICODE means what an end user thinks when he sees the characters U, n, i, c, o,
d, e on
an
'equally good Unicode format'.
And why do I keep this in the Unicode and end users thread? Because
invalid sequences (and old filenames) are a fact that users WILL experience
and pretending that this is just a case of non-conformance is not in the
best interest of the users.
Lars Kristan
MK What we are trying to establish is the exact meaning that UNICODE
MK ought to have - that is, if it can have one at all.
In the Unix-like world, the term ``UTF-8'' has been used quite
consistently, and most documentation avoids using Unicode for a disk
format (using it for the consortium,
On Thu, Feb 14, 2002 at 03:57:34PM +, Juliusz Chroboczek wrote:
MK What we are trying to establish is the exact meaning that UNICODE
MK ought to have - that is, if it can have one at all.
In the Unix-like world, the term ``UTF-8'' has been used quite
consistently, and most documentation
Lars Kristan [EMAIL PROTECTED] wrote:
AFAIK, UTF-8 files are NOT supposed to have a BOM in them.
Different operating systems and applications have different preferences.
There is no universal right or wrong about this. This is
unfortunate, but true.
Why is UTF-16 percieved as UNICODE? Well,
On Thu, Feb 14, 2002 at 05:46:46PM +0100, Keld Jørn Simonsen wrote:
I would rather recommend that you write ISO 10646 UTF-8 as the
ISO standard is a standard in many countries while Unicode is not.
*Grumble*. The whole point of this discussion is making it clear for the
users. Unicode is more
At 14:16 -0600 2002-02-14, David Starner wrote:
The whole point of this discussion is making it clear for the
users. Unicode is more clear for more users than ISO 10646 is. There is
no reason to use ISO 10646, besides pedanticness.
It is ISO/IEC 10646.
--
Michael Everson *** Everson Typography
From: Michael Everson [EMAIL PROTECTED]
At 14:16 -0600 2002-02-14, David Starner wrote:
There is no reason to use ISO 10646, besides pedanticness.
It is ISO/IEC 10646.
The defense rests.
MichKa
Michael Kaplan
Trigeminal Software, Inc. -- http://www.trigeminal.com/
At 09:22 AM 2/14/02 +, Martin Kochanski wrote:
Are there, in fact, many circumstances in which it is necessary for an end
user to create files that do *not* have a BOM at the beginning?
In principle this is a requirement for data being labelled *external to the
date* as being in either
: Asmus Freytag [mailto:[EMAIL PROTECTED]]
Sent: Thursday, 14 February 2002 17:46
To: Martin Kochanski; [EMAIL PROTECTED]
Subject: Re: Unicode and end users
At 09:22 AM 2/14/02 +, Martin Kochanski wrote:
Are there, in fact, many circumstances in which it is necessary for an
end
user to create
UTF-8 should *never* contain the BOM.
But has been pointed out, it is common practice for Microsoft, and also for
ICU's genrb tool, for example, which uses the BOM to autodetect the
encoding. The more example you'll see of that, the more people will use the
BOM (now, can't we all use -*-
Can you please expand on your statement that UTF-8 should never have a BOM?
Having one makes it very easy to distinguish a text file that contains UTF-8
from one that contains text in the system default MBCS encoding.
You may not be surprised to learn that Microsoft (or, at least, one of its
47 matches
Mail list logo