Re: pre-HTML5 and the BOM

2012-07-18 Thread Leif Halvard Silli
Martin J. Dürst, Wed, 18 Jul 2012 11:00:42 +0900:
 On 2012/07/17 23:11, Leif Halvard Silli wrote:
 Martin J. Dürst, Tue, 17 Jul 2012 18:49:47 +0900:
 On 2012/07/17 17:22, Leif Halvard Silli wrote:

 that a page with strict ASCII characters inside could still
 contain character entities/references for characters outside ASCII.
 
 Of course they can. … snip …
 
 And the question was whether such a page should default to be seen as
 UTF-8 encoded.
 
 If I understand correctly, whether it's seen as UTF-8 encoded would 
 be irrelevant when displaying the page, but might be relevant e.g. 
 for form submission and the like?

Yes. There might be technical problems too: HTML5 browsers are, when 
sniffing, asked to scan only a small beginning of the document. It 
might be a thin reason to default to UTF-8 just because start of the 
document contained no non-ASCII.

 … one browser where it does hurt more directly: … W3M …
 renders aring; and #229; as an 'aa' instead of as an 'å' …
 
 In a followup mail, you write:
 
 To quote one W3m slogan: 'Its 8-bit support is second to none'. W3m is
 a quite modern text browser. It is regularly updated, it can be used
 with emacs, and is the text browser I would recommend.
 
 If W3M is updated so regularly, why isn't the aring;/#229; - 'aa' 
 bug simply fixed?

Fair point. I've made the W3m mailing list aware of it.

 So it seems to me that it is always advantageous to type characters
 directly as doing so allows for better character encoding detection in
 case the encoding labels disappear (read: easier to pick up that the
 page is UTF-8 encoded) and also works better in at least one browser.
 It does, as well, make authors more aware of the entire encoding issue
 since it means that the page has to be properly labeled in order to
 work cross parsers.
 
 I agree that it general, characters should be encoded directly. There 
 may be exceptions such as nbsp;, where in some editing environments, 
 it's very helpful to see them explicitly.
 
 But a bug in a minor (or even a major) browser shouldn't be the 
 reason for avoiding character entities and numeric character 
 references.

Advising about how to code based on accidental bugs, is of course 
hopeless.

 The best reason is simply that nobody should be using 
 crutches as long as they can walk with their own legs.

Crutches, in that sense, is only about authoring convenience. And, of 
course, it is a difference between using named and numeric character 
references for a single non-ASCII letter as opposed to using it for all 
of them. Nevertheless: I, as Web author, would perhaps skip that 
convenience if I knew that doing so could improve e.g. HTML5 browser's 
ability to sniff the encoding correctly when all other encoding info is 
lost. If such sniffing can be an alternative to the BOM, and the BOM is 
questionable, then why not mention it as a reason to avoid the crutches?
-- 
Leif Halvard Silli




Re: UTF-8 BOM (Re: Charset declaration in HTML)

2012-07-18 Thread Martin J. Dürst

Hello Doug,

On 2012/07/18 0:35, Doug Ewell wrote:

For those who haven't yet had enough of this debate yet, here's a link
to an informative blog (with some informative comments) from Michael
Kaplan:

Every character has a story #4: U+feff (alternate title: UTF-8 is the
BOM, dude!)
http://blogs.msdn.com/b/michkap/archive/2005/01/20/357028.aspx

What should be interesting is that this blog dates to January 2005,
seven and a half years ago, and yet includes the following:

But every 4-6 months another huge thread on the Unicode List gets
started


Well, less or more than 4-6 months, but yes.


about how bad the BOM is for UTF-8 and how it breaks UNIX tools
that have been around and able to support UTF-8 without change for
decades


Yes indeed. The BOM and Unix/Linux tools don't work well together.


and about how Microsoft is evil for shipping Notepad that causes
all of these problems


That's a bit overblown, but I guess for a Microsoft employee, it looks 
like this.



and how neither the W3C nor Unicode would have
ever supported a UTF-8 BOM if Microsoft did not have Notepad doing it,


That's true, too. It was indeed Notepad that brought the UTF-8 
BOM/signature to the attention of the W3C and the browser makers.


The problem with the BOM in UTF-8 is that it can be quite helpful (for 
quickly distinguishing between UTF-8 and legacy-encoded files) and quite 
damaging (for programs that use the Unix/Linux model of text 
processing), and that's why it creates so much controversy.


Regards,   Martin.



Re: pre-HTML5 and the BOM

2012-07-18 Thread Martin J. Dürst

On 2012/07/18 16:35, Leif Halvard Silli wrote:

Martin J. Dürst, Wed, 18 Jul 2012 11:00:42 +0900:



The best reason is simply that nobody should be using
crutches as long as they can walk with their own legs.


Crutches, in that sense, is only about authoring convenience. And, of
course, it is a difference between using named and numeric character
references for a single non-ASCII letter as opposed to using it for all
of them. Nevertheless: I, as Web author, would perhaps skip that
convenience if I knew that doing so could improve e.g. HTML5 browser's
ability to sniff the encoding correctly when all other encoding info is
lost. If such sniffing can be an alternative to the BOM, and the BOM is
questionable, then why not mention it as a reason to avoid the crutches?


I'm not sure there are many people for whom using named character 
entities or numeric character references is a convenience. But for those 
for whom it is a convenience, let them use it.


Regards,   Martin.



Re: pre-HTML5 and the BOM

2012-07-18 Thread Leif Halvard Silli
Martin,

Martin J. Dürst, Wed, 18 Jul 2012 10:05:40 +0900:
 On 2012/07/18 4:35, Leif Halvard Silli wrote:
 But is the Windows Notepad really to blame?
 
 Pretty much so. There may have been other products from Microsoft 
 that also did it, but with respect to forcing browsers and XML 
 parsers to accept an UTF-8 BOM as a signature, Notepad was definitely 
 the main cause, by far.
 
 OK, it was leading the way.
 But can we think of something that could have worked better, in
 praxis? And, no, I don't mean 'better' as in 'not leaking the BOM into
 HTML'. I mean 'better' as in 'spreading the UTF-8 to the masses'.
 
 UTF-8 is easy and cheap to detect heuristically. It takes a bit more 
 work to scan the whole file than to just look at the first few bytes, 
 but then I don't think anybody is/was editing 1MB files in Notepad. 
 So the BOM/signature is definitely not the reason that UTF-8 spread 
 on the Web and elsewhere.

(The file length issue is an issue on the Web too.)

 The spread of UTF-8 is due to its strict US-ASCII compatibility. 
 Every US-ASCII character/byte represents the same character, and only 
 that character, in UTF-8. A plain ASCII file is an UTF-8 file. If 
 syntax-significant characters are ASCII, then (close to) nothing may 
 need to change when moving from a legacy encoding to UTF-8. On top of 
 that, character synchronization is very easy because leading bytes 
 and trailing bytes have strictly separate values. From that 
 viewpoint, the BOM is a problem rather than a solution.

I was thinking about NotePad: What else could NotePad have done - other 
than be turned into another program = delaying the entire UTF-8 
support? The closest to NotePad on OS X is probably TextEdit. On my OS 
X 10.5 computer, TextEdit does not sniff UTF-8 unless there is a BOM. 
Which means that TextEdit defaults to saving to UTF-8 (at least when 
the situation calls for it), however it does so without including the 
BOM. Which means that TextEdit fails to re-open the file as UTF-8.

On my OS X 10.7 computer, then TextEdit does sniff UTF-8 (without the 
BOM).

Someone mentioned 'the cost of doing business'. And you have pointed 
out that it takes time to realize ...  That NotePad could have done 
something else, seems to me to be quite hypothetical.

PS: I have tried to argue (in a bug report) that Webkit should default 
to UTF-8, including using UTF-8 detection. But I was shot down with the 
words that Webkit should work as all other browsers. So it seems one 
needs a 'notepad' - such as Chrome - to lead the way.

 I think that a browser fully dedicated to HTML4 but not intending to 
 implement HTML5 will eventually die out. If it exists today, it would 
 indeed be reasonable to accept the BOM. But that's not because 
 reading the spec(s) leads to that as the only conclusion, it's 
 because there's content out there that starts with a BOM.

It seems we agree that in 2012, 'pre-HTML5 browsers' can not be an 
argument that should cause a warning in the W3 HTML validator or in W3 
documents.
-- 
Leif Halvard Silli




Re: pre-HTML5 and the BOM

2012-07-18 Thread Leif Halvard Silli
Martin J. Dürst, Wed, 18 Jul 2012 17:20:31 +0900:
 On 2012/07/18 16:35, Leif Halvard Silli wrote:
 Martin J. Dürst, Wed, 18 Jul 2012 11:00:42 +0900:
 
 The best reason is simply that nobody should be using
 crutches as long as they can walk with their own legs.
 
 Crutches, in that sense, is only about authoring convenience. 
 […] Nevertheless: I, as Web author, would perhaps skip that
 convenience if I knew that doing so could improve e.g. HTML5 
 browser's ability to sniff the encoding correctly […]
 
 I'm not sure there are many people for whom using named character 
 entities or numeric character references is a convenience. But for 
 those for whom it is a convenience, let them use it.

By all means: Let them.

But the W3C's I18N working group still gives out advice about when to 
(not) use escapes.[1] Advice which the homepage of W3.org breaks - 
since every non-ASCII character of http://www.w3.org is escaped.

What the I18N group says in that document, is a bit moralistic (along 
the lines 'please think about how difficult it is for non-English 
authors to read escapes for all their characters). It seems to me that 
a mention of real effects on browser behavior could be a better form of 
advice. Especially when coupled with advice about avoiding the BOM.[2]

[1] http://www.w3.org/International/techniques/authoring-html#escapes
[2] http://www.w3.org/International/questions/qa-byte-order-mark#bomhow
-- 
Leif Halvard Silli




Re: pre-HTML5 and the BOM

2012-07-18 Thread Martin J. Dürst

Hello Leif,

I think that more and more, we are on the wrong mailing list.

Regards,   Martin.

On 2012/07/18 18:47, Leif Halvard Silli wrote:

Martin J. Dürst, Wed, 18 Jul 2012 17:20:31 +0900:

On 2012/07/18 16:35, Leif Halvard Silli wrote:

Martin J. Dürst, Wed, 18 Jul 2012 11:00:42 +0900:



The best reason is simply that nobody should be using
crutches as long as they can walk with their own legs.


Crutches, in that sense, is only about authoring convenience.
[…] Nevertheless: I, as Web author, would perhaps skip that
convenience if I knew that doing so could improve e.g. HTML5
browser's ability to sniff the encoding correctly […]


I'm not sure there are many people for whom using named character
entities or numeric character references is a convenience. But for
those for whom it is a convenience, let them use it.


By all means: Let them.

But the W3C's I18N working group still gives out advice about when to
(not) use escapes.[1] Advice which the homepage of W3.org breaks -
since every non-ASCII character of http://www.w3.org is escaped.

What the I18N group says in that document, is a bit moralistic (along
the lines 'please think about how difficult it is for non-English
authors to read escapes for all their characters). It seems to me that
a mention of real effects on browser behavior could be a better form of
advice. Especially when coupled with advice about avoiding the BOM.[2]

[1] http://www.w3.org/International/techniques/authoring-html#escapes
[2] http://www.w3.org/International/questions/qa-byte-order-mark#bomhow




Re: pre-HTML5 and the BOM

2012-07-18 Thread Steven Atreju
Except that the internet is almost unusable without cookies
and scripting, lynx(1) works very well, too, if the ncursesw
library is linked against (and the terminal font supports
Unicode characters).  Funny that it writes garbage for

 |htmlbodypä.ü.ö./p/body/html

but uses UTF-8 by default for

 |htmlbodypä.ü.ö./p/body/html

Hypertext offers a lot of possibilities to declare the charset,
and until then an agnostic 8-bit parser will do fine except
for multioctet charsets.

  Steven




Fwd: Re: UTF-8 BOM (Re: Charset declaration in HTML)

2012-07-18 Thread Steven Atreju
 Original Message 
Date: Wed, 18 Jul 2012 13:45:59 +0200
From: Steven Atreju snatr...@googlemail.com
To: Doug Ewell d...@ewellic.org
Subject: Re: UTF-8 BOM (Re: Charset declaration in HTML)

Doug Ewell wrote:

 |For those who haven't yet had enough of this debate yet, here's a link
 |to an informative blog (with some informative comments) from Michael
 |Kaplan:
 |
 |Every character has a story #4: U+feff (alternate title: UTF-8 is the
 |BOM, dude!)
 |http://blogs.msdn.com/b/michkap/archive/2005/01/20/357028.aspx
 |
 |What should be interesting is that this blog dates to January 2005,
 |seven and a half years ago, and yet includes the following:
 |
 |But every 4-6 months another huge thread on the Unicode List gets
 |started about how bad the BOM is for UTF-8 and how it breaks UNIX tools
 |that have been around and able to support UTF-8 without change for
 |decades and about how Microsoft is evil for shipping Notepad that causes
 |all of these problems and how neither the W3C nor Unicode would have
 |ever supported a UTF-8 BOM if Microsoft did not have Notepad doing it,
 |and so on, and so on.
 |
 |And here we are again.

Interesting, thanks for the pointer.  I didn't know that.
Funny that a program that cannot handle files larger than 0x7FFF
bytes (laste time i've used it, 95B) has such a large impact.
And sorry for the noise, then.

  Steven



Re: pre-HTML5 and the BOM

2012-07-18 Thread Leif Halvard Silli
Steven Atreju, Wed, 18 Jul 2012 13:40:30 +0200:
 Except that the internet is almost unusable without cookies
 and scripting, lynx(1) works very well, too, if the ncursesw
 library is linked against (and the terminal font supports
 Unicode characters).  Funny that it writes garbage for
 
  |htmlbodypä.ü.ö./p/body/html
 
 but uses UTF-8 by default for
 
  |htmlbodypä.ü.ö./p/body/html

Wow, a command line tool that breaks with all you have said about Unix 
tools, no? :-)

It would be perfectly in line with HTML5 if Lynx, with or without 
linking against ncurses, sniffed the first, BOM-less instance correctly 
too. However, so far, Chrome seems like the only browser to do so by 
default.

 Hypertext offers a lot of possibilities to declare the charset,
 and until then an agnostic 8-bit parser will do fine except
 for multioctet charsets.

One should perhaps not care about bugs ... But for Lynx, in the version 
I checked last (probably not linked to ncurses), then it did not 
understand HTML5's new meta charset=FOO any better than it 
understood the BOM. It only understood meta http-equiv=Content-Type 
content=FOO. So, since dropping the new meta element is not really 
an option, then, to always also the HTTP header on the server, is the 
absolutely safest thing ...
-- 
Leif H Silli




RE: Fwd: Re: UTF-8 BOM (Re: Charset declaration in HTML)

2012-07-18 Thread Doug Ewell
Steven Atreju snatreju at googlemail dot com wrote:

 Funny that a program that cannot handle files larger than 0x7FFF
 bytes (laste time i've used it, 95B) has such a large impact.

Notepad hasn't had this limitation since Windows Me. That was many, many
years ago.

--
Doug Ewell | Thornton, Colorado, USA
http://www.ewellic.org | @DougEwell ­





Re: pre-HTML5 and the BOM

2012-07-18 Thread John W Kennedy
On Jul 18, 2012, at 4:21 AM, Leif Halvard Silli wrote:
 On my OS X 10.7 computer, then TextEdit does sniff UTF-8 (without the 
 BOM).


It does indeed have a sniffing feature, though it also appears to use the 
com.apple.TextEncoding extended attribute, when available (and which it, 
itself, will create, when saving).

-- 
John W Kennedy
If Bill Gates believes in intelligent design, why can't he apply it to 
Windows?






Re: Fwd: Re: UTF-8 BOM (Re: Charset declaration in HTML)

2012-07-18 Thread Philippe Verdy
And those early versions of Notepad for 16/32-bit Windows were not
even Unicode compliant (the support for Unicode was minimalist, in
fact Unicode was only partly supported on top of the old ANSI/OEM
APIs; without support for the filesystem, and lots of quirks at the
kernel lelevel caused by conversions through only compatibility thunks
layer; the core memory manager was still just 16-bit, as well as
low-level OS n and BIOS interfaces; The 32-bit mode was just a DOS
extender, and there were lots of limitations compared to NT or OS/2 at
this time)
The renderer was only not capable of supporting complex scripts, and
supported just a small subset of TrueType and not OpenType.

But anyway this is not so old (Windows 98 and ME have still been sold,
shipped and supported some years after 2000). This was just 10 years
ago. And it took a lot of time to convince people (and developers) to
adopt XP (notably all gamers because most games were using their own
DOS extenders which could not work when the Windows GUI was running,
you had to return to DOS mode as they were conflicts between memory
managers and the EMM extenders, and the grpahics drivers for Windows
were not usable for games or too limited or too slow).

2012/7/18 Doug Ewell d...@ewellic.org:
 Steven Atreju snatreju at googlemail dot com wrote:

 Funny that a program that cannot handle files larger than 0x7FFF
 bytes (laste time i've used it, 95B) has such a large impact.

 Notepad hasn't had this limitation since Windows Me. That was many, many
 years ago.

 --
 Doug Ewell | Thornton, Colorado, USA
 http://www.ewellic.org | @DougEwell ­