Re: pre-HTML5 and the BOM
Martin J. Dürst, Wed, 18 Jul 2012 11:00:42 +0900: On 2012/07/17 23:11, Leif Halvard Silli wrote: Martin J. Dürst, Tue, 17 Jul 2012 18:49:47 +0900: On 2012/07/17 17:22, Leif Halvard Silli wrote: that a page with strict ASCII characters inside could still contain character entities/references for characters outside ASCII. Of course they can. … snip … And the question was whether such a page should default to be seen as UTF-8 encoded. If I understand correctly, whether it's seen as UTF-8 encoded would be irrelevant when displaying the page, but might be relevant e.g. for form submission and the like? Yes. There might be technical problems too: HTML5 browsers are, when sniffing, asked to scan only a small beginning of the document. It might be a thin reason to default to UTF-8 just because start of the document contained no non-ASCII. … one browser where it does hurt more directly: … W3M … renders aring; and #229; as an 'aa' instead of as an 'å' … In a followup mail, you write: To quote one W3m slogan: 'Its 8-bit support is second to none'. W3m is a quite modern text browser. It is regularly updated, it can be used with emacs, and is the text browser I would recommend. If W3M is updated so regularly, why isn't the aring;/#229; - 'aa' bug simply fixed? Fair point. I've made the W3m mailing list aware of it. So it seems to me that it is always advantageous to type characters directly as doing so allows for better character encoding detection in case the encoding labels disappear (read: easier to pick up that the page is UTF-8 encoded) and also works better in at least one browser. It does, as well, make authors more aware of the entire encoding issue since it means that the page has to be properly labeled in order to work cross parsers. I agree that it general, characters should be encoded directly. There may be exceptions such as nbsp;, where in some editing environments, it's very helpful to see them explicitly. But a bug in a minor (or even a major) browser shouldn't be the reason for avoiding character entities and numeric character references. Advising about how to code based on accidental bugs, is of course hopeless. The best reason is simply that nobody should be using crutches as long as they can walk with their own legs. Crutches, in that sense, is only about authoring convenience. And, of course, it is a difference between using named and numeric character references for a single non-ASCII letter as opposed to using it for all of them. Nevertheless: I, as Web author, would perhaps skip that convenience if I knew that doing so could improve e.g. HTML5 browser's ability to sniff the encoding correctly when all other encoding info is lost. If such sniffing can be an alternative to the BOM, and the BOM is questionable, then why not mention it as a reason to avoid the crutches? -- Leif Halvard Silli
Re: UTF-8 BOM (Re: Charset declaration in HTML)
Hello Doug, On 2012/07/18 0:35, Doug Ewell wrote: For those who haven't yet had enough of this debate yet, here's a link to an informative blog (with some informative comments) from Michael Kaplan: Every character has a story #4: U+feff (alternate title: UTF-8 is the BOM, dude!) http://blogs.msdn.com/b/michkap/archive/2005/01/20/357028.aspx What should be interesting is that this blog dates to January 2005, seven and a half years ago, and yet includes the following: But every 4-6 months another huge thread on the Unicode List gets started Well, less or more than 4-6 months, but yes. about how bad the BOM is for UTF-8 and how it breaks UNIX tools that have been around and able to support UTF-8 without change for decades Yes indeed. The BOM and Unix/Linux tools don't work well together. and about how Microsoft is evil for shipping Notepad that causes all of these problems That's a bit overblown, but I guess for a Microsoft employee, it looks like this. and how neither the W3C nor Unicode would have ever supported a UTF-8 BOM if Microsoft did not have Notepad doing it, That's true, too. It was indeed Notepad that brought the UTF-8 BOM/signature to the attention of the W3C and the browser makers. The problem with the BOM in UTF-8 is that it can be quite helpful (for quickly distinguishing between UTF-8 and legacy-encoded files) and quite damaging (for programs that use the Unix/Linux model of text processing), and that's why it creates so much controversy. Regards, Martin.
Re: pre-HTML5 and the BOM
On 2012/07/18 16:35, Leif Halvard Silli wrote: Martin J. Dürst, Wed, 18 Jul 2012 11:00:42 +0900: The best reason is simply that nobody should be using crutches as long as they can walk with their own legs. Crutches, in that sense, is only about authoring convenience. And, of course, it is a difference between using named and numeric character references for a single non-ASCII letter as opposed to using it for all of them. Nevertheless: I, as Web author, would perhaps skip that convenience if I knew that doing so could improve e.g. HTML5 browser's ability to sniff the encoding correctly when all other encoding info is lost. If such sniffing can be an alternative to the BOM, and the BOM is questionable, then why not mention it as a reason to avoid the crutches? I'm not sure there are many people for whom using named character entities or numeric character references is a convenience. But for those for whom it is a convenience, let them use it. Regards, Martin.
Re: pre-HTML5 and the BOM
Martin, Martin J. Dürst, Wed, 18 Jul 2012 10:05:40 +0900: On 2012/07/18 4:35, Leif Halvard Silli wrote: But is the Windows Notepad really to blame? Pretty much so. There may have been other products from Microsoft that also did it, but with respect to forcing browsers and XML parsers to accept an UTF-8 BOM as a signature, Notepad was definitely the main cause, by far. OK, it was leading the way. But can we think of something that could have worked better, in praxis? And, no, I don't mean 'better' as in 'not leaking the BOM into HTML'. I mean 'better' as in 'spreading the UTF-8 to the masses'. UTF-8 is easy and cheap to detect heuristically. It takes a bit more work to scan the whole file than to just look at the first few bytes, but then I don't think anybody is/was editing 1MB files in Notepad. So the BOM/signature is definitely not the reason that UTF-8 spread on the Web and elsewhere. (The file length issue is an issue on the Web too.) The spread of UTF-8 is due to its strict US-ASCII compatibility. Every US-ASCII character/byte represents the same character, and only that character, in UTF-8. A plain ASCII file is an UTF-8 file. If syntax-significant characters are ASCII, then (close to) nothing may need to change when moving from a legacy encoding to UTF-8. On top of that, character synchronization is very easy because leading bytes and trailing bytes have strictly separate values. From that viewpoint, the BOM is a problem rather than a solution. I was thinking about NotePad: What else could NotePad have done - other than be turned into another program = delaying the entire UTF-8 support? The closest to NotePad on OS X is probably TextEdit. On my OS X 10.5 computer, TextEdit does not sniff UTF-8 unless there is a BOM. Which means that TextEdit defaults to saving to UTF-8 (at least when the situation calls for it), however it does so without including the BOM. Which means that TextEdit fails to re-open the file as UTF-8. On my OS X 10.7 computer, then TextEdit does sniff UTF-8 (without the BOM). Someone mentioned 'the cost of doing business'. And you have pointed out that it takes time to realize ... That NotePad could have done something else, seems to me to be quite hypothetical. PS: I have tried to argue (in a bug report) that Webkit should default to UTF-8, including using UTF-8 detection. But I was shot down with the words that Webkit should work as all other browsers. So it seems one needs a 'notepad' - such as Chrome - to lead the way. I think that a browser fully dedicated to HTML4 but not intending to implement HTML5 will eventually die out. If it exists today, it would indeed be reasonable to accept the BOM. But that's not because reading the spec(s) leads to that as the only conclusion, it's because there's content out there that starts with a BOM. It seems we agree that in 2012, 'pre-HTML5 browsers' can not be an argument that should cause a warning in the W3 HTML validator or in W3 documents. -- Leif Halvard Silli
Re: pre-HTML5 and the BOM
Martin J. Dürst, Wed, 18 Jul 2012 17:20:31 +0900: On 2012/07/18 16:35, Leif Halvard Silli wrote: Martin J. Dürst, Wed, 18 Jul 2012 11:00:42 +0900: The best reason is simply that nobody should be using crutches as long as they can walk with their own legs. Crutches, in that sense, is only about authoring convenience. […] Nevertheless: I, as Web author, would perhaps skip that convenience if I knew that doing so could improve e.g. HTML5 browser's ability to sniff the encoding correctly […] I'm not sure there are many people for whom using named character entities or numeric character references is a convenience. But for those for whom it is a convenience, let them use it. By all means: Let them. But the W3C's I18N working group still gives out advice about when to (not) use escapes.[1] Advice which the homepage of W3.org breaks - since every non-ASCII character of http://www.w3.org is escaped. What the I18N group says in that document, is a bit moralistic (along the lines 'please think about how difficult it is for non-English authors to read escapes for all their characters). It seems to me that a mention of real effects on browser behavior could be a better form of advice. Especially when coupled with advice about avoiding the BOM.[2] [1] http://www.w3.org/International/techniques/authoring-html#escapes [2] http://www.w3.org/International/questions/qa-byte-order-mark#bomhow -- Leif Halvard Silli
Re: pre-HTML5 and the BOM
Hello Leif, I think that more and more, we are on the wrong mailing list. Regards, Martin. On 2012/07/18 18:47, Leif Halvard Silli wrote: Martin J. Dürst, Wed, 18 Jul 2012 17:20:31 +0900: On 2012/07/18 16:35, Leif Halvard Silli wrote: Martin J. Dürst, Wed, 18 Jul 2012 11:00:42 +0900: The best reason is simply that nobody should be using crutches as long as they can walk with their own legs. Crutches, in that sense, is only about authoring convenience. […] Nevertheless: I, as Web author, would perhaps skip that convenience if I knew that doing so could improve e.g. HTML5 browser's ability to sniff the encoding correctly […] I'm not sure there are many people for whom using named character entities or numeric character references is a convenience. But for those for whom it is a convenience, let them use it. By all means: Let them. But the W3C's I18N working group still gives out advice about when to (not) use escapes.[1] Advice which the homepage of W3.org breaks - since every non-ASCII character of http://www.w3.org is escaped. What the I18N group says in that document, is a bit moralistic (along the lines 'please think about how difficult it is for non-English authors to read escapes for all their characters). It seems to me that a mention of real effects on browser behavior could be a better form of advice. Especially when coupled with advice about avoiding the BOM.[2] [1] http://www.w3.org/International/techniques/authoring-html#escapes [2] http://www.w3.org/International/questions/qa-byte-order-mark#bomhow
Re: pre-HTML5 and the BOM
Except that the internet is almost unusable without cookies and scripting, lynx(1) works very well, too, if the ncursesw library is linked against (and the terminal font supports Unicode characters). Funny that it writes garbage for |htmlbodypä.ü.ö./p/body/html but uses UTF-8 by default for |htmlbodypä.ü.ö./p/body/html Hypertext offers a lot of possibilities to declare the charset, and until then an agnostic 8-bit parser will do fine except for multioctet charsets. Steven
Fwd: Re: UTF-8 BOM (Re: Charset declaration in HTML)
Original Message Date: Wed, 18 Jul 2012 13:45:59 +0200 From: Steven Atreju snatr...@googlemail.com To: Doug Ewell d...@ewellic.org Subject: Re: UTF-8 BOM (Re: Charset declaration in HTML) Doug Ewell wrote: |For those who haven't yet had enough of this debate yet, here's a link |to an informative blog (with some informative comments) from Michael |Kaplan: | |Every character has a story #4: U+feff (alternate title: UTF-8 is the |BOM, dude!) |http://blogs.msdn.com/b/michkap/archive/2005/01/20/357028.aspx | |What should be interesting is that this blog dates to January 2005, |seven and a half years ago, and yet includes the following: | |But every 4-6 months another huge thread on the Unicode List gets |started about how bad the BOM is for UTF-8 and how it breaks UNIX tools |that have been around and able to support UTF-8 without change for |decades and about how Microsoft is evil for shipping Notepad that causes |all of these problems and how neither the W3C nor Unicode would have |ever supported a UTF-8 BOM if Microsoft did not have Notepad doing it, |and so on, and so on. | |And here we are again. Interesting, thanks for the pointer. I didn't know that. Funny that a program that cannot handle files larger than 0x7FFF bytes (laste time i've used it, 95B) has such a large impact. And sorry for the noise, then. Steven
Re: pre-HTML5 and the BOM
Steven Atreju, Wed, 18 Jul 2012 13:40:30 +0200: Except that the internet is almost unusable without cookies and scripting, lynx(1) works very well, too, if the ncursesw library is linked against (and the terminal font supports Unicode characters). Funny that it writes garbage for |htmlbodypä.ü.ö./p/body/html but uses UTF-8 by default for |htmlbodypä.ü.ö./p/body/html Wow, a command line tool that breaks with all you have said about Unix tools, no? :-) It would be perfectly in line with HTML5 if Lynx, with or without linking against ncurses, sniffed the first, BOM-less instance correctly too. However, so far, Chrome seems like the only browser to do so by default. Hypertext offers a lot of possibilities to declare the charset, and until then an agnostic 8-bit parser will do fine except for multioctet charsets. One should perhaps not care about bugs ... But for Lynx, in the version I checked last (probably not linked to ncurses), then it did not understand HTML5's new meta charset=FOO any better than it understood the BOM. It only understood meta http-equiv=Content-Type content=FOO. So, since dropping the new meta element is not really an option, then, to always also the HTTP header on the server, is the absolutely safest thing ... -- Leif H Silli
RE: Fwd: Re: UTF-8 BOM (Re: Charset declaration in HTML)
Steven Atreju snatreju at googlemail dot com wrote: Funny that a program that cannot handle files larger than 0x7FFF bytes (laste time i've used it, 95B) has such a large impact. Notepad hasn't had this limitation since Windows Me. That was many, many years ago. -- Doug Ewell | Thornton, Colorado, USA http://www.ewellic.org | @DougEwell
Re: pre-HTML5 and the BOM
On Jul 18, 2012, at 4:21 AM, Leif Halvard Silli wrote: On my OS X 10.7 computer, then TextEdit does sniff UTF-8 (without the BOM). It does indeed have a sniffing feature, though it also appears to use the com.apple.TextEncoding extended attribute, when available (and which it, itself, will create, when saving). -- John W Kennedy If Bill Gates believes in intelligent design, why can't he apply it to Windows?
Re: Fwd: Re: UTF-8 BOM (Re: Charset declaration in HTML)
And those early versions of Notepad for 16/32-bit Windows were not even Unicode compliant (the support for Unicode was minimalist, in fact Unicode was only partly supported on top of the old ANSI/OEM APIs; without support for the filesystem, and lots of quirks at the kernel lelevel caused by conversions through only compatibility thunks layer; the core memory manager was still just 16-bit, as well as low-level OS n and BIOS interfaces; The 32-bit mode was just a DOS extender, and there were lots of limitations compared to NT or OS/2 at this time) The renderer was only not capable of supporting complex scripts, and supported just a small subset of TrueType and not OpenType. But anyway this is not so old (Windows 98 and ME have still been sold, shipped and supported some years after 2000). This was just 10 years ago. And it took a lot of time to convince people (and developers) to adopt XP (notably all gamers because most games were using their own DOS extenders which could not work when the Windows GUI was running, you had to return to DOS mode as they were conflicts between memory managers and the EMM extenders, and the grpahics drivers for Windows were not usable for games or too limited or too slow). 2012/7/18 Doug Ewell d...@ewellic.org: Steven Atreju snatreju at googlemail dot com wrote: Funny that a program that cannot handle files larger than 0x7FFF bytes (laste time i've used it, 95B) has such a large impact. Notepad hasn't had this limitation since Windows Me. That was many, many years ago. -- Doug Ewell | Thornton, Colorado, USA http://www.ewellic.org | @DougEwell