Re: UTF-8 BOM (Re: Charset declaration in HTML)

2012-07-18 Thread Martin J. Dürst
Hello Doug, On 2012/07/18 0:35, Doug Ewell wrote: For those who haven't yet had enough of this debate yet, here's a link to an informative blog (with some informative comments) from Michael Kaplan: Every character has a story #4: U+feff (alternate title: UTF-8 is the BOM, dude!)

Fwd: Re: UTF-8 BOM (Re: Charset declaration in HTML)

2012-07-18 Thread Steven Atreju
Original Message Date: Wed, 18 Jul 2012 13:45:59 +0200 From: Steven Atreju snatr...@googlemail.com To: Doug Ewell d...@ewellic.org Subject: Re: UTF-8 BOM (Re: Charset declaration in HTML) Doug Ewell wrote: |For those who haven't yet had enough of this debate yet, here's a link

RE: Fwd: Re: UTF-8 BOM (Re: Charset declaration in HTML)

2012-07-18 Thread Doug Ewell
Steven Atreju snatreju at googlemail dot com wrote: Funny that a program that cannot handle files larger than 0x7FFF bytes (laste time i've used it, 95B) has such a large impact. Notepad hasn't had this limitation since Windows Me. That was many, many years ago. -- Doug Ewell | Thornton,

Re: Fwd: Re: UTF-8 BOM (Re: Charset declaration in HTML)

2012-07-18 Thread Philippe Verdy
And those early versions of Notepad for 16/32-bit Windows were not even Unicode compliant (the support for Unicode was minimalist, in fact Unicode was only partly supported on top of the old ANSI/OEM APIs; without support for the filesystem, and lots of quirks at the kernel lelevel caused by

Re: UTF-8 BOM (Re: Charset declaration in HTML)

2012-07-17 Thread Steven Atreju
Philippe Verdy verd...@wanadoo.fr wrote: |2012/7/16 Steven Atreju snatr...@googlemail.com: | Fifteen years ago i think i would have put effort in including the | BOM after reading this, for complete correctness! I'm pretty sure | that i really would have done so. | |Fifteen years ago I

Re: UTF-8 BOM (Re: Charset declaration in HTML)

2012-07-17 Thread Doug Ewell
For those who haven't yet had enough of this debate yet, here's a link to an informative blog (with some informative comments) from Michael Kaplan: Every character has a story #4: U+feff (alternate title: UTF-8 is the BOM, dude!) http://blogs.msdn.com/b/michkap/archive/2005/01/20/357028.aspx

Re: UTF-8 BOM (Re: Charset declaration in HTML)

2012-07-17 Thread Julian Bradfield
On 2012-07-16, Philippe Verdy verd...@wanadoo.fr wrote: I am also convinced that even Shell interpreters on Linux/Unix should recognize and accept the leading BOM before the hash/bang starting line (which is commonly used for filetype identification and runtime behavior), without claiming that

Re: UTF-8 BOM (Re: Charset declaration in HTML)

2012-07-17 Thread Philippe Verdy
2012/7/17 Julian Bradfield jcb+unic...@inf.ed.ac.uk: On 2012-07-16, Philippe Verdy verd...@wanadoo.fr wrote: I am also convinced that even Shell interpreters on Linux/Unix should recognize and accept the leading BOM before the hash/bang starting line (which is commonly used for filetype

Re: UTF-8 BOM (Re: Charset declaration in HTML)

2012-07-17 Thread Martin J. Dürst
Hello Philippe, On 2012/07/18 3:37, Philippe Verdy wrote: 2012/7/17 Julian Bradfieldjcb+unic...@inf.ed.ac.uk: On 2012-07-16, Philippe Verdyverd...@wanadoo.fr wrote: I am also convinced that even Shell interpreters on Linux/Unix should recognize and accept the leading BOM before the hash/bang

Re: UTF-8 BOM (Re: Charset declaration in HTML)

2012-07-16 Thread Steven Atreju
Doug Ewell d...@ewellic.org wrote: |Steven Atreju wrote: | | If Unicode *defines* that the so-called BOM is in fact a Unicode- | indicating tag that MUST be present, | |But Unicode does not define that. Nope. On http://unicode.org/faq/utf_bom.html i read: Q: Why do some of the UTFs

Re: UTF-8 BOM (Re: Charset declaration in HTML)

2012-07-16 Thread Leif Halvard Silli
Steven Atreju, Mon, 16 Jul 2012 13:35:04 +0200: Doug Ewell d...@ewellic.org wrote: And: Q: Is the UTF-8 encoding scheme the same irrespective of whether the underlying processor is little endian or big endian? ... Where a BOM is used with UTF-8, it is only used as an ecoding

Re: UTF-8 BOM (Re: Charset declaration in HTML)

2012-07-16 Thread Doug Ewell
Steven Atreju wrote: Q: Is the UTF-8 encoding scheme the same irrespective of whether the underlying processor is little endian or big endian? ... Where a BOM is used with UTF-8, it is only used as an ecoding signature to distinguish UTF-8 from other encodings — it has nothing

Re: UTF-8 BOM (Re: Charset declaration in HTML)

2012-07-15 Thread Doug Ewell
Steven Atreju wrote: If Unicode *defines* that the so-called BOM is in fact a Unicode- indicating tag that MUST be present, But Unicode does not define that. I know that, in Germany, many, many small libraries become closed because there is not enough money available to keep up with the

Re: UTF-8 BOM (Re: Charset declaration in HTML)

2012-07-14 Thread Eli Zaretskii
Date: Fri, 13 Jul 2012 22:07:54 +0200 From: Steven Atreju snatr...@googlemail.com Cc: unicode@unicode.org this time without reply-in-same-charset and encoding=8bit and i bet it comes out as UTF-8 on the other end: Yes, it does.

Re: UTF-8 BOM (Re: Charset declaration in HTML)

2012-07-14 Thread Steven Atreju
Eli Zaretskii e...@gnu.org wrote: | Date: Fri, 13 Jul 2012 22:07:54 +0200 | From: Steven Atreju snatr...@googlemail.com | Cc: unicode@unicode.org | | this time without reply-in-same-charset and | encoding=8bit and i bet it comes out as UTF-8 on the other end: | |Yes, it does. ..cheer..

Re: UTF-8 BOM (Re: Charset declaration in HTML)

2012-07-13 Thread Steven Atreju
Philippe Verdy verd...@wanadoo.fr wrote: |2012/7/12 Steven Atreju snatr...@googlemail.com: | UTF-8 is a bytestream, not multioctet(/multisequence). |Not even. UTF-8 is a text-stream, not made of arbitrary sequences of |bytes. It has a lot of internal semantics and constraints. Some things

Re: UTF-8 BOM (Re: Charset declaration in HTML)

2012-07-13 Thread Philippe Verdy
2012/7/13 Steven Atreju snatr...@googlemail.com: Philippe Verdy verd...@wanadoo.fr wrote: |2012/7/12 Steven Atreju snatr...@googlemail.com: | UTF-8 is a bytestream, not multioctet(/multisequence). |Not even. UTF-8 is a text-stream, not made of arbitrary sequences of |bytes. It has a lot

Re: UTF-8 BOM (Re: Charset declaration in HTML)

2012-07-13 Thread Eli Zaretskii
Date: Fri, 13 Jul 2012 16:04:44 +0200 From: Steven Atreju snatr...@googlemail.com For example, this mail is written in an UTF-8 enabled vi(1) basically from 1986, in UTF-8 encoding («Schöne Überraschung, gelle?» No, it isn't: User-Agent: S-nail 12.5 7/5/10;s-nail-9-g517ac44-dirty

Re: UTF-8 BOM (Re: Charset declaration in HTML)

2012-07-13 Thread Steven Atreju
Eli Zaretskii e...@gnu.org wrote: | For example, this mail is | written in an UTF-8 enabled vi(1) basically from 1986, in UTF-8 | encoding («Schöne Überraschung, gelle?» | |No, it isn't: | |Content-Type: text/plain; charset=ISO-8859-1 Oh, it's really terrible. I do have

Re: UTF-8 BOM (Re: Charset declaration in HTML)

2012-07-13 Thread Steven Atreju
Philippe Verdy verd...@wanadoo.fr wrote: |2012/7/13 Steven Atreju snatr...@googlemail.com: | Philippe Verdy verd...@wanadoo.fr wrote: | | |2012/7/12 Steven Atreju snatr...@googlemail.com: | | UTF-8 is a bytestream, not multioctet(/multisequence). | |Not even. UTF-8 is a text-stream, not

UTF-8 BOM (Re: Charset declaration in HTML)

2012-07-12 Thread Steven Atreju
| As for editors: If your own editor have no problems with the BOM, then | what? But I think Notepad can also save as UTF-8 but without the BOM - | there should be possible to get an option for choosing when you save | it. | |Perhaps there should be such an option in Notepad, but there

Re: UTF-8 BOM (Re: Charset declaration in HTML)

2012-07-12 Thread Leif Halvard Silli
Steven Atreju, Thu, 12 Jul 2012 12:32:46 +0200: In the meanwhile the UTF-8 BOM is in the standard and thus contradicts fourty years of (well) good (Unix/POSIX) engineering and craftsmanship. Where a file is a file and everything is a file, holistically. Where small tools which do their

Re: UTF-8 BOM (Re: Charset declaration in HTML)

2012-07-12 Thread David Starner
On Thu, Jul 12, 2012 at 4:06 AM, Leif Halvard Silli xn--mlform-...@xn--mlform-iua.no wrote: I guess you get the same problem with UTF-16 files also, then? UTF-16 isn't a text file in the Unix world; it's a binary file. UTF-8 is the only standard Unicode encoding that acts like text to a Unix

Re: UTF-8 BOM (Re: Charset declaration in HTML)

2012-07-12 Thread Julian Bradfield
On 2012-07-12, Steven Atreju snatr...@googlemail.com wrote: In the future simple things like '$ cat File1 File2 File3' will no longer work that easily. Currently this works *whatever* file, and even program code that has been written more than thirty years ago will work correctly. No! You

Re: UTF-8 BOM (Re: Charset declaration in HTML)

2012-07-12 Thread Steven Atreju
Leif Halvard Silli xn--mlform-...@xn--mlform-iua.no wrote: |Steven Atreju, Thu, 12 Jul 2012 12:32:46 +0200: | | In the meanwhile the UTF-8 BOM is in the standard and thus | contradicts fourty years of (well) good (Unix/POSIX) engineering | and craftsmanship. Where a file is a file and

Re: UTF-8 BOM (Re: Charset declaration in HTML)

2012-07-12 Thread Philippe Verdy
Right. Unix was unique when it was created as it was built to handle all files as unstructured binary files. The history os a lot different, and text files have always used another paradigm, based n line records. End of lines initially were not really control characters. And even today the

Re: UTF-8 BOM (Re: Charset declaration in HTML)

2012-07-12 Thread Philippe Verdy
2012/7/12 Steven Atreju snatr...@googlemail.com: UTF-8 is a bytestream, not multioctet(/multisequence). Not even. UTF-8 is a text-stream, not made of arbitrary sequences of bytes. It has a lot of internal semantics and constraints. Some things are very meaningful, some play absolutely no role at