Hello Doug,
On 2012/07/18 0:35, Doug Ewell wrote:
For those who haven't yet had enough of this debate yet, here's a link
to an informative blog (with some informative comments) from Michael
Kaplan:
Every character has a story #4: U+feff (alternate title: UTF-8 is the
BOM, dude!)
Original Message
Date: Wed, 18 Jul 2012 13:45:59 +0200
From: Steven Atreju snatr...@googlemail.com
To: Doug Ewell d...@ewellic.org
Subject: Re: UTF-8 BOM (Re: Charset declaration in HTML)
Doug Ewell wrote:
|For those who haven't yet had enough of this debate yet, here's a link
Steven Atreju snatreju at googlemail dot com wrote:
Funny that a program that cannot handle files larger than 0x7FFF
bytes (laste time i've used it, 95B) has such a large impact.
Notepad hasn't had this limitation since Windows Me. That was many, many
years ago.
--
Doug Ewell | Thornton,
And those early versions of Notepad for 16/32-bit Windows were not
even Unicode compliant (the support for Unicode was minimalist, in
fact Unicode was only partly supported on top of the old ANSI/OEM
APIs; without support for the filesystem, and lots of quirks at the
kernel lelevel caused by
Philippe Verdy verd...@wanadoo.fr wrote:
|2012/7/16 Steven Atreju snatr...@googlemail.com:
| Fifteen years ago i think i would have put effort in including the
| BOM after reading this, for complete correctness! I'm pretty sure
| that i really would have done so.
|
|Fifteen years ago I
For those who haven't yet had enough of this debate yet, here's a link
to an informative blog (with some informative comments) from Michael
Kaplan:
Every character has a story #4: U+feff (alternate title: UTF-8 is the
BOM, dude!)
http://blogs.msdn.com/b/michkap/archive/2005/01/20/357028.aspx
On 2012-07-16, Philippe Verdy verd...@wanadoo.fr wrote:
I am also convinced that even Shell interpreters on Linux/Unix should
recognize and accept the leading BOM before the hash/bang starting
line (which is commonly used for filetype identification and runtime
behavior), without claiming that
2012/7/17 Julian Bradfield jcb+unic...@inf.ed.ac.uk:
On 2012-07-16, Philippe Verdy verd...@wanadoo.fr wrote:
I am also convinced that even Shell interpreters on Linux/Unix should
recognize and accept the leading BOM before the hash/bang starting
line (which is commonly used for filetype
Hello Philippe,
On 2012/07/18 3:37, Philippe Verdy wrote:
2012/7/17 Julian Bradfieldjcb+unic...@inf.ed.ac.uk:
On 2012-07-16, Philippe Verdyverd...@wanadoo.fr wrote:
I am also convinced that even Shell interpreters on Linux/Unix should
recognize and accept the leading BOM before the hash/bang
Doug Ewell d...@ewellic.org wrote:
|Steven Atreju wrote:
|
| If Unicode *defines* that the so-called BOM is in fact a Unicode-
| indicating tag that MUST be present,
|
|But Unicode does not define that.
Nope. On http://unicode.org/faq/utf_bom.html i read:
Q: Why do some of the UTFs
Steven Atreju, Mon, 16 Jul 2012 13:35:04 +0200:
Doug Ewell d...@ewellic.org wrote:
And:
Q: Is the UTF-8 encoding scheme the same irrespective of whether
the underlying processor is little endian or big endian?
...
Where a BOM is used with UTF-8, it is only used as an ecoding
Steven Atreju wrote:
Q: Is the UTF-8 encoding scheme the same irrespective of whether
the underlying processor is little endian or big endian?
...
Where a BOM is used with UTF-8, it is only used as an ecoding
signature to distinguish UTF-8 from other encodings — it has
nothing
Steven Atreju wrote:
If Unicode *defines* that the so-called BOM is in fact a Unicode-
indicating tag that MUST be present,
But Unicode does not define that.
I know that, in Germany, many, many small libraries become closed
because there is not enough money available to keep up with the
Date: Fri, 13 Jul 2012 22:07:54 +0200
From: Steven Atreju snatr...@googlemail.com
Cc: unicode@unicode.org
this time without reply-in-same-charset and
encoding=8bit and i bet it comes out as UTF-8 on the other end:
Yes, it does.
Eli Zaretskii e...@gnu.org wrote:
| Date: Fri, 13 Jul 2012 22:07:54 +0200
| From: Steven Atreju snatr...@googlemail.com
| Cc: unicode@unicode.org
|
| this time without reply-in-same-charset and
| encoding=8bit and i bet it comes out as UTF-8 on the other end:
|
|Yes, it does.
..cheer..
Philippe Verdy verd...@wanadoo.fr wrote:
|2012/7/12 Steven Atreju snatr...@googlemail.com:
| UTF-8 is a bytestream, not multioctet(/multisequence).
|Not even. UTF-8 is a text-stream, not made of arbitrary sequences of
|bytes. It has a lot of internal semantics and constraints. Some things
2012/7/13 Steven Atreju snatr...@googlemail.com:
Philippe Verdy verd...@wanadoo.fr wrote:
|2012/7/12 Steven Atreju snatr...@googlemail.com:
| UTF-8 is a bytestream, not multioctet(/multisequence).
|Not even. UTF-8 is a text-stream, not made of arbitrary sequences of
|bytes. It has a lot
Date: Fri, 13 Jul 2012 16:04:44 +0200
From: Steven Atreju snatr...@googlemail.com
For example, this mail is
written in an UTF-8 enabled vi(1) basically from 1986, in UTF-8
encoding («Schöne Überraschung, gelle?»
No, it isn't:
User-Agent: S-nail 12.5 7/5/10;s-nail-9-g517ac44-dirty
Eli Zaretskii e...@gnu.org wrote:
| For example, this mail is
| written in an UTF-8 enabled vi(1) basically from 1986, in UTF-8
| encoding («Schöne Überraschung, gelle?»
|
|No, it isn't:
|
|Content-Type: text/plain; charset=ISO-8859-1
Oh, it's really terrible. I do have
Philippe Verdy verd...@wanadoo.fr wrote:
|2012/7/13 Steven Atreju snatr...@googlemail.com:
| Philippe Verdy verd...@wanadoo.fr wrote:
|
| |2012/7/12 Steven Atreju snatr...@googlemail.com:
| | UTF-8 is a bytestream, not multioctet(/multisequence).
| |Not even. UTF-8 is a text-stream, not
| As for editors: If your own editor have no problems with the BOM, then
| what? But I think Notepad can also save as UTF-8 but without the BOM -
| there should be possible to get an option for choosing when you save
| it.
|
|Perhaps there should be such an option in Notepad, but there
Steven Atreju, Thu, 12 Jul 2012 12:32:46 +0200:
In the meanwhile the UTF-8 BOM is in the standard and thus
contradicts fourty years of (well) good (Unix/POSIX) engineering
and craftsmanship. Where a file is a file and everything is a
file, holistically. Where small tools which do their
On Thu, Jul 12, 2012 at 4:06 AM, Leif Halvard Silli
xn--mlform-...@xn--mlform-iua.no wrote:
I guess you get the same problem with UTF-16 files also, then?
UTF-16 isn't a text file in the Unix world; it's a binary file. UTF-8
is the only standard Unicode encoding that acts like text to a Unix
On 2012-07-12, Steven Atreju snatr...@googlemail.com wrote:
In the future simple things like '$ cat File1 File2 File3' will
no longer work that easily. Currently this works *whatever* file,
and even program code that has been written more than thirty years
ago will work correctly. No! You
Leif Halvard Silli xn--mlform-...@xn--mlform-iua.no wrote:
|Steven Atreju, Thu, 12 Jul 2012 12:32:46 +0200:
|
| In the meanwhile the UTF-8 BOM is in the standard and thus
| contradicts fourty years of (well) good (Unix/POSIX) engineering
| and craftsmanship. Where a file is a file and
Right. Unix was unique when it was created as it was built to handle
all files as unstructured binary files. The history os a lot
different, and text files have always used another paradigm, based n
line records. End of lines initially were not really control
characters. And even today the
2012/7/12 Steven Atreju snatr...@googlemail.com:
UTF-8 is a bytestream, not multioctet(/multisequence).
Not even. UTF-8 is a text-stream, not made of arbitrary sequences of
bytes. It has a lot of internal semantics and constraints. Some things
are very meaningful, some play absolutely no role at
27 matches
Mail list logo