Re: [whatwg] Default encoding to UTF-8?

2012-01-03 Thread Henri Sivonen
On Thu, Dec 22, 2011 at 12:36 PM, Leif Halvard Silli l...@russisk.no wrote:
 It's unclear to me if you are talking about HTTP-level charset=UNICODE
 or charset=UNICODE in a meta. Is content labeled with charset=UNICODE
 BOMless?

 Charset=UNICODE in meta, as generated by MS tools (Office or IE, eg.)
 seems to usually be BOM-full. But there are still enough occurrences
 of pages without BOM. I have found UTF-8 pages with the charset=unicode
 label in meta. But the few page I found contained either BOM or
 HTTP-level charset=utf-8. I have to little research material when it
 comes to UTF-8 pages with charset=unicode inside.

Making 'unicode' an alias of UTF-16 or UTF-16LE would be useless for
pages that have a BOM, because the BOM is already inspected before
meta and if HTTP-level charset is unrecognized, the BOM wins.

Making 'unicode' an alias of UTF-16 or UTF-16LE would be useful for
UTF-8-encoded pages that say charset=unicode in meta if alias
resolution happens before UTF-16 labels are mapped to UTF-8.

Making 'unicode' an alias for UTF-16 or UTF-16LE would be useless for
pages that are (BOMless) UTF-16LE and that have charset=unicode in
meta, because the meta prescan doesn't see UTF-16-encoded metas.
Furthermore, it doesn't make sense to make the meta prescan look for
UTF-16-encoded metas, because it would make sense to honor the value
only if it matched a flavor of UTF-16 appropriate for the pattern of
zero bytes in the file, so it would be more reliable and straight
forward to just analyze the pattern of zero bytes without bothering to
look for UTF-16-encoded metas.

 When the detector says UTF-8 - that is step 7 of the sniffing algorith,
 no?
 http://dev.w3.org/html5/spec/parsing.html#determining-the-character-encoding

Yes.

  2) Start the parse assuming UTF-8 and reload as Windows-1252 if the
 detector says non-UTF-8.
...
 I think you are mistaken there: If parsers perform UTF-8 detection,
 then unlabelled pages will be detected, and no reparsing will happen.
 Not even increase. You at least need to explain this negative spiral
 theory better before I buy it ... Step 7 will *not* lead to reparsing
 unless the default encoding is WINDOWS-1252. If the default encoding is
 UTF-8, then step 7, when it detects UTF-8, then it means that parsing
 can continue uninterrupted.

That would be what I labeled as option #2 above.

 What we will instead see is that those using legacy encodings must be
 more clever in labelling their pages, or else they won't be detected.

Many pages that use legacy encodings are legacy pages that aren't
actively maintained. Unmaintained pages aren't going to become more
clever about labeling.

 I am a bitt baffled here: It sounds like you say that there will be bad
 consequences if browsers becomes more reliable ...

Becoming more reliable can be bad if the reliability comes at the cost
of performance, which would be the case if the kind of heuristic
detector that e.g. Firefox has was turned on for all locales. (I don't
mean the performance impact of running a detector state machine. I
mean the performance impact of reloading the page or, alternatively,
the loss of incremental rendering.)

A solution that would border on reasonable would be decoding as
US-ASCII up to the first non-ASCII byte and then deciding between
UTF-8 and the locale-specific legacy encoding by examining the first
non-ASCII byte and up to 3 bytes after it to see if they form a valid
UTF-8 byte sequence. But trying to gain more statistical confidence
about UTF-8ness than that would be bad for performance (either due to
stalling stream processing or due to reloading).

 Apart from UTF-16, Chrome seems quite aggressive w.r.t. encoding
 detection. So it might still be an competitive advantage.

It would be interesting to know what exactly Chrome does. Maybe
someone who knows the code could enlighten us?

 * Let's say that I *kept* ISO-8859-1 as default encoding, but instead
 enabled the Universal detector. The frame then works.
 * But if I make the frame page very short, 10 * the letter ø as
 content, then the Universal detector fails - on a test on my own
 computer, it guess the page to be Cyrillic rather than Norwegian.
 * What's the problem? The Universal detector is too greedy - it tries
 to fix more problems than I have. I only want it to guess on UTF-8.
 And if it doesn't detect UTF-8, then it should fall back to the locale
 default (including fall back to the encoding of the parent frame).

 Wouldn't that be an idea?

 No. The current configuration works for Norwegian users already. For
 users from different silos, the ad might break, but ad breakage is
 less bad than spreading heuristic detection to more locales.

 Here I must disagree: Less bad for whom?

For users performance-wise.

--
Henri Sivonen
hsivo...@iki.fi
http://hsivonen.iki.fi/


Re: [whatwg] Default encoding to UTF-8?

2012-01-03 Thread Henri Sivonen
On Tue, Jan 3, 2012 at 10:33 AM, Henri Sivonen hsivo...@iki.fi wrote:
 A solution that would border on reasonable would be decoding as
 US-ASCII up to the first non-ASCII byte and then deciding between
 UTF-8 and the locale-specific legacy encoding by examining the first
 non-ASCII byte and up to 3 bytes after it to see if they form a valid
 UTF-8 byte sequence. But trying to gain more statistical confidence
 about UTF-8ness than that would be bad for performance (either due to
 stalling stream processing or due to reloading).

And it's worth noting that the above paragraph states a solution to
the problem that is: How to make it possible to use UTF-8 without
declaring it?

Adding autodetection wouldn't actually force authors to use UTF-8, so
the problem Faruk stated at the start of the thread (authors not using
UTF-8 throughout systems that process user input) wouldn't be solved.

-- 
Henri Sivonen
hsivo...@iki.fi
http://hsivonen.iki.fi/


[whatwg] A few questions on HTML5

2012-01-03 Thread Mani
Hi,

I had a few quick questions on HTML5 (I have been looking at it for
about a month now, and I am fascinated by the possibilities).

1. I believe HTML5 will use no DTD, is that right? So there will be
rules for HTML5 processors (browsers) that they should follow? Can you
also comment as to why?? (I believe the reason is to support not
well-formed HTML and still not contradict other standards.??)

2. Will XHTML5 have a DTD, because XHTML5 must be well-formed?

I will have a couple of more questions based on these responses.

thanks and best, murali.


Re: [whatwg] A few questions on HTML5

2012-01-03 Thread Bronislav Klučka

Hi

On 3.1.2012 10:32, Mani wrote:

I had a few quick questions on HTML5 (I have been looking at it for
about a month now, and I am fascinated by the possibilities).
I do not want to be rude, but such questions does not belong here, this 
is not generic or QA forum, this is spec. designers technical 
discussion group.

see
http://www.whatwg.org/mailing-list
http://forums.whatwg.org/bb3/index.php

1. I believe HTML5 will use no DTD, is that right? So there will be
rules for HTML5 processors (browsers) that they should follow? Can you
also comment as to why?? (I believe the reason is to support not
well-formed HTML and still not contradict other standards.??)

http://dev.w3.org/html5/spec/syntax.html#syntax
http://dev.w3.org/html5/spec/parsing.html#parsing

HTML5 is self contained language, not based on SGML or XML. The reason 
is simply due to historical baggages, which are not according to 
specification yet which are industry standarts. WHATWG recognized the 
fact that some structures may be invalid according to SGML/XML, but 
still can be useful / the most obvious / etc. solutions. So WHATWG 
simply took the reality existing on the web for 2 decades and created 
(is creating) language, that will be completely specified and backward 
compatible.
To put it simply, HTML has begun to be too large and important to be 
bound by certain constraining rules.



2. Will XHTML5 have a DTD, because XHTML5 must be well-formed?

http://www.w3.org/TR/html5/the-xhtml-syntax.html#writing-xhtml-documents
http://wiki.whatwg.org/wiki/HTML_vs._XHTML


I will have a couple of more questions based on these responses.

You can find lot of answers by reading specification and other resources

Spec: http://dev.w3.org/html5/spec/spec.html
Differences from HTML4: http://dev.w3.org/html5/html4-differences/
Guide for developers: http://dev.w3.org/html5/html-author/
HTML5 Design principles: http://dev.w3.org/html5/html-design-principles/


Bronislav Klucka


Re: [whatwg] A few questions on HTML5

2012-01-03 Thread Jukka K. Korpela

2012-01-03 12:45, Bronislav Klučka wrote:


On 3.1.2012 10:32, Mani wrote:

[…]

2. Will XHTML5 have a DTD, because XHTML5 must be well-formed?

http://www.w3.org/TR/html5/the-xhtml-syntax.html#writing-xhtml-documents
http://wiki.whatwg.org/wiki/HTML_vs._XHTML


To find an answer to the question that was asked, one needs to read 
quite a lot between the lines in the cited documents. The answer appears 
to be “No, XHTML5 won’t have a DTD, since you’re supposed to use a 
validator specifically written for HTML5.”


Allowing XHTML 1.0 and XHTML 1.1 DOCTYPEs as “obsolete but conforming” 
and not saying a word about any DTD that sovers any of the HTML5 
novelties looks like a clear indication of intent.


Well-formedness requirement does not imply the need for a DTD at all. Au 
contraire, “well-formed” is just a confusing term for conformance to 
generic XML rules (“well-formed XML” really means nothing but “XML”), as 
opposite to any rules in a DTD for example.


This appears to mean that when XHTML5 is used together with other XML 
tag sets, you cannot use a DTD-based validator just by adding the 
declarations for the other tags into an XHTML5 DTD. So the question 
really is: will someone want to validate, say, XHTML5 + MathML documents?


Yucca


Re: [whatwg] Default encoding to UTF-8?

2012-01-03 Thread Leif Halvard Silli
Henri Sivonen, Tue Jan 3 00:33:02 PST 2012:
 On Thu, Dec 22, 2011 at 12:36 PM, Leif Halvard Silli wrote:

 Making 'unicode' an alias of UTF-16 or UTF-16LE would be useful for
 UTF-8-encoded pages that say charset=unicode in meta if alias
 resolution happens before UTF-16 labels are mapped to UTF-8.

Yup.
 
 Making 'unicode' an alias for UTF-16 or UTF-16LE would be useless for
 pages that are (BOMless) UTF-16LE and that have charset=unicode in
 meta, because the meta prescan doesn't see UTF-16-encoded metas.

Hm. Yes. I see that I misread something, and ended up believing that 
the meta would *still* be used if the mapping from 'UTF-16' to 
'UTF-8' turned out to be incorrect. I guess I had not understood, well 
enough, that the meta prescan *really* doesn't see UTF-16-encoded 
metas. Also contributing was the fact that I did nto realize that IE 
doesn't actually read the page as UTF-16 but as Windows-1252: 
http://www.hughesrenier.be/actualites.html. (Actually, browsers does 
see the UTF-16 meta, but only if the default encoding is set to be 
UTF-16 - see step 1 of '8.2.2.4 Changing the encoding while parsing' 
http://dev.w3.org/html5/spec/parsing.html#change-the-encoding.)

 Furthermore, it doesn't make sense to make the meta prescan look for
 UTF-16-encoded metas, because it would make sense to honor the value
 only if it matched a flavor of UTF-16 appropriate for the pattern of
 zero bytes in the file, so it would be more reliable and straight
 forward to just analyze the pattern of zero bytes without bothering to
 look for UTF-16-encoded metas.

Makes sense.

   [ snip ]
 What we will instead see is that those using legacy encodings must be
 more clever in labelling their pages, or else they won't be detected.
 
 Many pages that use legacy encodings are legacy pages that aren't
 actively maintained. Unmaintained pages aren't going to become more
 clever about labeling.

But their Non-UTF-8-ness should be picked up in the first 1024 bytes?

  [... sniff - sorry, meant snip ;-) ...]

 I mean the performance impact of reloading the page or, 
 alternatively, the loss of incremental rendering.)

 A solution that would border on reasonable would be decoding as
 US-ASCII up to the first non-ASCII byte

Thus possibly prescan of more than 1024 bytes? Is it faster to scan 
ASCII? (In Chrome, there does not seem to be an end to the prescan, as 
long as the text source code is ASCII only.)

 and then deciding between
 UTF-8 and the locale-specific legacy encoding by examining the first
 non-ASCII byte and up to 3 bytes after it to see if they form a valid
 UTF-8 byte sequence.

Except for the specifics, that sounds like more or less the idea I 
tried to state. May be it could be made into a bug in Mozilla? (I could 
do it, but ...)

However, there is one thing that should be added: The parser should 
default to UTF-8 even if it does not detect any UTF-8-ish non-ASCII. Is 
that part of your idea? Because, if it does not behave like that, then 
it would work as Google Chrome now does work. Which for the following, 
UTF-8 encoded (but charset-un-labelled) page means, that it default to 
UTF-8:

!DOCTYPE htmltitleæøå/title/html

While it for this - identical - page, would default to the locale 
encoding, due to the use of ASCII based character entities, which 
causes that it does not detect any UTF-8-ish characters:

!DOCTYPE htmltitle#xe6;#xf8;#xe5;/title/html

As weird variant of the latter example is UTF-8 based data URIs, where 
all browsers (that I could test - IE only supports data URIs in the 
@src attribute, including script@src) default to the locale encoding 
(apart for Mozilla Camino - which has character detection enabled by 
default):

data:text/html,!DOCTYPE htmltitle%C3%A6%C3%B8%C3%A5/title/html

All the 3 examples above should default to UTF-8, if the border on 
sane approach was applied.

 But trying to gain more statistical confidence
 about UTF-8ness than that would be bad for performance (either due to
 stalling stream processing or due to reloading).

So here you say tthat it is better to start to present early, and 
eventually reload [I think] if during the presentation the encoding 
choice shows itself to be wrong, than it would be to investigate too 
much and be absolutely certain before starting to present the page.

Later, at Jan 3 00:50:26 PST 2012, you added:
 And it's worth noting that the above paragraph states a solution to
 the problem that is: How to make it possible to use UTF-8 without
 declaring it?

Indeed.

 Adding autodetection wouldn't actually force authors to use UTF-8, so
 the problem Faruk stated at the start of the thread (authors not using
 UTF-8 throughout systems that process user input) wouldn't be solved.

If we take that logic to its end, then it would not make sense for the 
validator to display an error when a page contains a form without being 
UTF-8 encoded, either. Because, after all, the backend/whatever could 
be non-UTF-8 based. The only way to solve that 

Re: [whatwg] [encoding] utf-16

2012-01-03 Thread Leif Halvard Silli
Henri Sivonen, Mon Jan 2 07:43:07 PST 2012
 On Fri, Dec 30, 2011 at 12:54 PM, Anne van Kesteren wrote:
 And why should there be UTF-16 sniffing?
 
 The reason why Gecko detects BOMless Basic Latin-only UTF-16
 regardless of the heuristic detector mode is
 https://bugzilla.mozilla.org/show_bug.cgi?id=631751

That bug was not solved perfectly. E.g. this page renders readable in 
IE, but not in Firefox: http://www.hughesrenier.be/actualites.html. 
(For some reason, it renders well if I download it to my harddisk.)
 
 It's quite possible that Firefox could have gotten away with not
 having this behavior.
-- 
Leif Halvard Silli


Re: [whatwg] [encoding] utf-16

2012-01-03 Thread Leif Halvard Silli
Leif Halvard Silli, Tue, 3 Jan 2012 23:51:52 +0100:
 Henri Sivonen, Mon Jan 2 07:43:07 PST 2012
 On Fri, Dec 30, 2011 at 12:54 PM, Anne van Kesteren wrote:
 And why should there be UTF-16 sniffing?
 
 The reason why Gecko detects BOMless Basic Latin-only UTF-16
 regardless of the heuristic detector mode is
 https://bugzilla.mozilla.org/show_bug.cgi?id=631751
 
 That bug was not solved perfectly. E.g. this page renders readable in 
 IE, but not in Firefox: http://www.hughesrenier.be/actualites.html. 
 (For some reason, it renders well if I download it to my harddisk.)

Oops, that was of course because the HTTP level said ISO-8859-1.

 It's quite possible that Firefox could have gotten away with not
 having this behavior.
-- 
Leif H Silli


[whatwg] the impact of select.value behavior clearing current selection prior to setting the new selection

2012-01-03 Thread Jon Lee
Hello,

A long while ago[1] there was a clarification[2] made to the html5 spec about 
how setting selectedIndex and value clears out the selectedness of all options 
prior to setting the selection. As of this writing, Safari, Chrome, Opera, and 
Firefox leave the selection alone if the code sets value to null or a string 
that does not match any of the existing options. I was wondering if anyone had 
any insight on impact for existing websites with this new behavior.

Thanks,
Jon

[1] http://lists.whatwg.org/htdig.cgi/whatwg-whatwg.org/2008-October/016842.html
[2] http://html5.org/tools/web-apps-tracker?from=2292to=2293