Re: [whatwg] Encoding sniffing algorithm

2012-09-09 Thread Leif Halvard Silli
Ian Hickson ian at hixie.ch  on Thu Sep 6 12:55:03 PDT 2012:
 On Fri, 27 Jul 2012, Leif Halvard Silli wrote:

 Revised encoding sniffing algorithm proposal:
 
 NEW! 0. document is XML format - opt out of the algorithm.
 [This step is already implicit in the spec, but it would
 make sense to explicitly include it to make sure that
 one could e.g. write test cases to see that it is step
 is implemented. Currently Safari, Chrome and Opera do 
 not 100% implement this step.]
 
 I don't understand the relevance of the algorithm to XML. Why would anyone 
 even look at this algorithm if they were parsing XML?

In principle it should not be needed. Agree. 

But many of those who are parsing XML are also parsing HTML - for that 
reason it should be natural for them to compare specs and requirements. 
Currently, in particular Webkit and Chromium seem to be colored by 
their HTML parsing when they parse XML. (See the table in my blog 
post.) Also, the spec do a few time includes phrases similar to if it 
is XML, then abort these steps (for example in '3.4.1 Opening the 
input stream'),[*] so there is some precedence, I think.

[*] 
http://www.whatwg.org/specs/web-apps/current-work/multipage/elements.html#opening-the-input-stream

 NEW! #. Alternative: The BOM signa­ture could go here instead of 
 in step 5. There is a bug to move the BOM hereto and make
 it override anything else. What speaks against this are:
   a) that Firefox, IE10 and Opera do not currently have
  this behavior.
   b) this revision of the sniffing algorithm, especially
  the revision in step 6 (required UTF-8 detection),
  might make the BOM-trumps-everything-else override
  less necessary
 What speaks for this override:
   a) Safari, Chrome and legacy IE implement it.
   b) some legacy content may depend on it
 
 Not sure what this means.

You will be dealing with it when you take care of Anne's bug: Bug 
15359 Make BOM trump HTTP. [*] Thus, you can just ignore it. 
[*] https://www.w3.org/Bugs/Public/show_bug.cgi?id=15359. 


  1. user override.
 (PS: The spec should clarify whether user override is
  cacheable.)
 
 This seems to be entirely a user interface issue.

But then, why do you go on to describe it in the new note? (See below.)


 NEW! 2. iframe inherits user override from parent browsing context
 [Currently not mentioned in the spec, despite that all
  UAs do have this step for HTML docs.]
 
 That's a UI issue much like whether it's remembered or not. But I've added 
 a non-normative note.

Your new note:

1. Typically, user agents remember such user requests 
   across sessions, and in some cases apply them to 
   documents in iframes as well.

My comments:

   1: How does that differ from the info on the likely encoding step?

   2: Could you define 'sessions' somewhere? It sounds to me that the 
'sessions' behavior that you describe resembles the Opera behavior. 
Which is bad when the Opera behavior is the least typical one. (And 
most annoying from a page developer's point of view.) The typical thing 
- which Opera breaks! - is to, in some way or another, limit the 
encoding override to the current *tab* only. Thus, if you insist on 
describing what UAs typically do, then you should instead of 
describing the exception (Opera), say that browsers *differ*, but that 
the typical thing is to limit the encoding override, some way or 
another, to the current tab. 

   3: Browses differ enough for you to evaluate how they behave and 
pick the best behavior. However, I'd say Firefox is best as it offers a 
compromise between IE and Webkit. (See belows.)

Comments in more details:

FIRSTLY: Regarding across sessions. then my assumption would be 
that a single session is equal to the lifespan of a single tab (or a 
single window, if there is no Tab in the window). If so, then that is 
how Safari/Chrome behave: Override lasts as long as one stays in the 
current frame.

SECONDLY: Does 'sessions' relate to a particular document - as in 
document during several sessions? Or to a particular tab/window - as 
in session = tab?
  * Under FIRSTLY, I described how Safari/Chrome behave: They do not 
give heed to the document. They *only* give heed to the current 
tab/window: If you override a document to use the KOI8-R encoding then 
the next document you load in the same tab will use the KOI8-R encoding 
too.
  * Internet Explorer (version 8, at least) will, by contrast, give 
heed to that particular document, it seems. Thus, it seems to not reuse 
the overridden encoding in case it meets a new document, in the same 
tab, whose encoding is not declared. *However*, just as Safari/Chrome, 
once you open the same document (whose encoding was overridden) in a 
new tab, then it doesn't remember the encoding override anymore. So the 
encoding 

Re: [whatwg] Encoding sniffing algorithm

2012-09-06 Thread Ian Hickson
On Fri, 27 Jul 2012, Leif Halvard Silli wrote:

 I have just written a document on how implementations prioritize 
 encoding info for HTML documents.[1] (As that document shows, I have not 
 tested Safari 6.) Based on my findings there, I would like to suggest 
 that the spec's encoding sniffing algorithm should be updated to look as 
 follows:
 
 Revised encoding sniffing algorithm proposal:
 
 NEW! 0. document is XML format - opt out of the algorithm.
 [This step is already implicit in the spec, but it would
 make sense to explicitly include it to make sure that
 one could e.g. write test cases to see that it is step
 is implemented. Currently Safari, Chrome and Opera do 
 not 100% implement this step.]

I don't understand the relevance of the algorithm to XML. Why would anyone 
even look at this algorithm if they were parsing XML?


 NEW! #. Alternative: The BOM signa­ture could go here instead of 
 in step 5. There is a bug to move the BOM hereto and make
 it override anything else. What speaks against this are:
   a) that Firefox, IE10 and Opera do not currently have
  this behavior.
   b) this revision of the sniffing algorithm, especially
  the revision in step 6 (required UTF-8 detection),
  might make the BOM-trumps-everything-else override
  less necessary
 What speaks for this override:
   a) Safari, Chrome and legacy IE implement it.
   b) some legacy content may depend on it

Not sure what this means.


  1. user override.
 (PS: The spec should clarify whether user override is
  cacheable.)

This seems to be entirely a user interface issue.


 NEW! 2. iframe inherits user override from parent browsing context
 [Currently not mentioned in the spec, despite that all
  UAs do have this step for HTML docs.]

That's a UI issue much like whether it's remembered or not. But I've added 
a non-normative note.


 NEW! 6. UTF-8 detection.
 I think we should separate UTF-8 detection from other
 detection in order to make this step obligatory.
 The newness here is only the limitation to UTF-8
 detection plus that it should be obligatory. 
 (Thus: If it is not detected as UTF-8, then
 the parser proceeds to next step in the algorithm.)
 This step would make browsers lean more strongly 
 towards UTF-8.

Without a specific algorithm to detect UTF-8, this is meaningless.


 NEW! 7. parent browsing context default.
 The current spec does not mention this step at all,
 despite that both Opera, IE, Safari, Chrome, Firefox
 do implement it.

Added. (Some comprehensive testing of this would be good, e.g. comparing 
it to each of the earlier and later steps, considering it with different 
ways of giving the encoding, differnet locales, etc.)


 Regarding 6. and 7., then the order is important. Chrome
 does for instance perform UTF-8 detection, but it does it
 only /after/ the parent browsing context. Whereas everyone
 else (Opera 12 by default, Firefox for some locales - don't
 know if there are others) let it happen before the 'parent
 browsing context default'.

Can you elaborate on this?


 NEW! 8. info on “the likely encoding”
 The main newness is that this step is placed _after_ 
 the (revised) UTF-8 detection and after the (new) parent
 browsing context default.
 The name 'the likely encoding' is from the current spec
 text. I am a bit uncertain about what it means in the 
 current spec, though. So I move here what I think make
 sense. The steps under this point should perhaps be
 optional:
 
 a. detection of other charsets than UTF-8
(e.g the optional Cyrillic detection in
Firefox or legacy Asian encoding detection.
The actual detection might happen in step 6,
but it should only be made to count here.)

I don't understand your reasoning on the desired ordering here.


 b. markup label of the sister language
?xml version=1.0 encoding=UTF-8?
(Opera/Webkit/Chrome currently have this directly
after the native encoding label step - step 5.

No idea what this means.


 c. Other things? What does likely encoding current
refer to, exactly?

The spec gives an example.

-- 
Ian Hickson   U+1047E)\._.,--,'``.fL
http://ln.hixie.ch/   U+263A/,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'

[whatwg] Encoding sniffing algorithm - update proposal

2012-07-26 Thread Leif Halvard Silli
I have just written a document on how implementations prioritize 
encoding info for HTML documents.[1] (As that document shows, I have 
not tested Safari 6.) Based on my findings there, I would like to 
suggest that the spec's encoding sniffing algorithm should be updated 
to look as follows:

Revised encoding sniffing algorithm proposal:

NEW! 0. document is XML format - opt out of the algorithm.
[This step is already implicit in the spec, but it would
make sense to explicitly include it to make sure that
one could e.g. write test cases to see that it is step
is implemented. Currently Safari, Chrome and Opera do 
not 100% implement this step.]
 
NEW! #. Alternative: The BOM signa­ture could go here instead of 
in step 5. There is a bug to move the BOM hereto and make
it override anything else. What speaks against this are:
  a) that Firefox, IE10 and Opera do not currently have
 this behavior.
  b) this revision of the sniffing algorithm, especially
 the revision in step 6 (required UTF-8 detection),
 might make the BOM-trumps-everything-else override
 less necessary
What speaks for this override:
  a) Safari, Chrome and legacy IE implement it.
  b) some legacy content may depend on it

 1. user override.
(PS: The spec should clarify whether user override is
 cacheable.)

NEW! 2. iframe inherits user override from parent browsing context
[Currently not mentioned in the spec, despite that all
 UAs do have this step for HTML docs.]

 3. explicit charset attribute in Content-Type header.

 4. BOM signa­ture [or as the second step, see above]

 5. native markup label meta charset=UTF-8

NEW! 6. UTF-8 detection.
I think we should separate UTF-8 detection from other
detection in order to make this step obligatory.
The newness here is only the limitation to UTF-8
detection plus that it should be obligatory. 
(Thus: If it is not detected as UTF-8, then
the parser proceeds to next step in the algorithm.)
This step would make browsers lean more strongly 
towards UTF-8.

NEW! 7. parent browsing context default.
The current spec does not mention this step at all,
despite that both Opera, IE, Safari, Chrome, Firefox
do implement it.

Regarding 6. and 7., then the order is important. Chrome
does for instance perform UTF-8 detection, but it does it
only /after/ the parent browsing context. Whereas everyone
else (Opera 12 by default, Firefox for some locales - don't
know if there are others) let it happen before the 'parent
browsing context default'.

NEW! 8. info on “the likely encoding”
The main newness is that this step is placed _after_ 
the (revised) UTF-8 detection and after the (new) parent
browsing context default.
The name 'the likely encoding' is from the current spec
text. I am a bit uncertain about what it means in the 
current spec, though. So I move here what I think make
sense. The steps under this point should perhaps be
optional:

a. detection of other charsets than UTF-8
   (e.g the optional Cyrillic detection in
   Firefox or legacy Asian encoding detection.
   The actual detection might happen in step 6,
   but it should only be made to count here.)
b. markup label of the sister language
   ?xml version=1.0 encoding=UTF-8?
   (Opera/Webkit/Chrome currently have this directly
   after the native encoding label step - step 5.
c. Other things? What does likely encoding current
   refer to, exactly?

 9. locale default

[1] 
http://malform.no/blog/white-spots-in-html5-s-encoding-sniffing-algorithm
[2] To the question of whether the BOM should trump everything else, 
then I think it it would be more important to get the other parts of 
this algorithm right. If we do get the rest of it right, then the 'BOM 
should trump' argument, becomes less important.
-- 
Leif Halvard Silli

Re: [whatwg] Encoding Sniffing

2012-04-23 Thread Henri Sivonen
On Sat, Apr 21, 2012 at 1:21 PM, Anne van Kesteren ann...@opera.com wrote:
 This morning I looked into what it would take to define Encoding Sniffing.
 http://wiki.whatwg.org/wiki/Encoding#Sniffing has links as to what I looked
 at (minus Opera internal). As far as I can tell Gecko has the most
 comprehensive approach and should not be too hard to define (though writing
 it all out correctly and clear will be some work).

The Gecko notes aren't quite right:
 * The detector chosen from the UI is used for HTML and plain text
when loading those in a browsing context from HTTP GET or from a
non-http URL. (Not used for POST responses. Not used for XHR.)
 * The default for the UI setting depends on the locale. Most locales
default to know detector at all. Only zh-TW defaults to the Universal
detector. (I'm not sure why, but I think this is a bug of *some* kind.
Perhaps the localizer wanted to detect both Traditional and Simplified
Chinese encodings and we don't have a detector configuration for
TraditionalSimplified.) Other locales that default to having a
detector enabled default to a locale-specific detector (e.g. Japanese
or Ukranian).
 * The Universal detector is used regardless of UI setting or locale
when using the FileReader to read a local file as text. (I'm
personally very unhappy about this sort of use of heuristics in a new
feature.)
 * The Universal detector isn't really universal. In particular, it
misdetects Central European encodings like ISO-8859-2. (I'm personally
unhappy that we expose the Universal detector in the UI and thereby
bait people to enable it.)
 * Regardless of detector setting, when loading HTML or plain text in
a browsing context, Basic Latin encoded as UTF-16BE or UTF-16LE is
detected. This detection is not performed by FileReader.

 I have some questions though:

 1) Is this something we want to define and eventually implement the same
 way?

I think yes in principle. In practice, it might be hard to get this
done. E.g. in the case of Gecko, we'd need someone who has no higher
priority work than rewriting chardet in compliance with the
hypothetical spec.

I don't want to enable heuristic detection for all HTML page loads.
Yet, it seems that we can't get rid of it for e.g. the Japanese
context. (It's so sad that the situation is the worst in places that
have multiple encodings and, therefore, logically should be more aware
of the need to declare which one is in use. Sigh.) I think it is bad
that the Web-exposed behavior of the browser depends on the UI locale
of the browser. I think it would be worthwhile research project to
find out if that were feasible to trigger language-specific heuristic
detection on a per TLD basis instead on a per UI locale basis (e.g.
enabling the Japanese detector for all pages loaded from .jp and the
Russian detector for all pages loaded from .ru regardless of UI locale
and requiring .com Japanese or Russian sites to get their charset act
together or maybe having a short list of popular special cases that
don't use a country TLD but don't declare the encoding, either).

 2) Does this need to apply outside HTML? For JavaScript it forbidden per the
 HTML standard at the moment. CSS and XML do not allow it either. Is it used
 for decoding text/plain at the moment?

Detection is used for text/plain in Gecko when it would be used for text/html.

I think detection shouldn't be used for anything except plain text and
HTML being loaded into browsing context considering that we've managed
this far without it (well, except for FileReader).  (Note that when
not declaring the encoding on their own JavaScript and CSS inherit the
encoding of the HTML document that references them.)

 3) Is there a limit to how many bytes we should look at?

In Gecko, the Basic Latin encoded as UTF-16BE or UTF-16LE check is run
on the first 1024 bytes.  For the other heuristic detections, there is
no limit and changing the encoding potentially causes renavigation to
the page.  During the Firefox for development cycle, there was a limit
of 1024 bytes (no renavigation!), but it was removed in order to
support the Japanese Planet Debian (site fixed since then) and other
unspecified but rumored Japanese sites.

On Sun, Apr 22, 2012 at 2:11 AM, Silvia Pfeiffer
silviapfeiff...@gmail.com wrote:
 We've had some discussion on the usefulness of this in WebVTT - mostly
 just in relation with HTML, though I am sure that stand-alone video
 players that decode WebVTT would find it useful, too.

WebVTT is a new format with no legacy. Instead of letting it become
infected with heuristic detection, we should go the other direction
and hardwire it as UTF-8 like we did with app cache manifests and
JSON-in-XHR.  No one should be creating new content in encodings other
than UTF-8. Those who can't be bothered to use The Encoding deserve
REPLACEMENT CHARACTERs. Heuristic detection is for unlabeled legacy
content.

-- 
Henri Sivonen
hsivo...@iki.fi
http://hsivonen.iki.fi/


Re: [whatwg] Encoding Sniffing

2012-04-23 Thread Julian Reschke

On 2012-04-23 10:19, Henri Sivonen wrote:

...
  * The Universal detector is used regardless of UI setting or locale
when using the FileReader to read a local file as text. (I'm
personally very unhappy about this sort of use of heuristics in a new
feature.)

 ...

+1


...
WebVTT is a new format with no legacy. Instead of letting it become
infected with heuristic detection, we should go the other direction
and hardwire it as UTF-8 like we did with app cache manifests and
JSON-in-XHR.  No one should be creating new content in encodings other
than UTF-8. Those who can't be bothered to use The Encoding deserve
REPLACEMENT CHARACTERs. Heuristic detection is for unlabeled legacy
content.
...


+1




Re: [whatwg] Encoding Sniffing

2012-04-23 Thread Alexey Proskuryakov

21.04.2012, в 3:21, Anne van Kesteren написал(а):

 1) Is this something we want to define and eventually implement the same way?

I think that the general direction should be getting rid of encoding sniffing. 
It's very rarely helpful if ever, and implementations are wildly different.

WebKit can optionally use ICU for charset detection. We also have custom 
built-in heuristics to switch between Japanese encodings only (think rendering 
unlabeled EUC-JP pages when default browser encoding is set to Shift-JIS). 
Safari doesn't enable ICU based detection to no visible user disconcert, and I 
don't know if the Japanese heuristics are still important.

 2) Does this need to apply outside HTML? For JavaScript it forbidden per the 
 HTML standard at the moment. CSS and XML do not allow it either. Is it used 
 for decoding text/plain at the moment?
 3) Is there a limit to how many bytes we should look at?

Related to the last question, WebKit doesn't implement re-navigation (neither 
for charset sniffing, nor for meta charset), and I don't think that we ever 
should.

- WBR, Alexey Proskuryakov



[whatwg] Encoding Sniffing

2012-04-21 Thread Anne van Kesteren

Hey,

This morning I looked into what it would take to define Encoding Sniffing.  
http://wiki.whatwg.org/wiki/Encoding#Sniffing has links as to what I  
looked at (minus Opera internal). As far as I can tell Gecko has the most  
comprehensive approach and should not be too hard to define (though  
writing it all out correctly and clear will be some work).


I have some questions though:

1) Is this something we want to define and eventually implement the same  
way?
2) Does this need to apply outside HTML? For JavaScript it forbidden per  
the HTML standard at the moment. CSS and XML do not allow it either. Is it  
used for decoding text/plain at the moment?

3) Is there a limit to how many bytes we should look at?

Thanks,


--
Anne van Kesteren
http://annevankesteren.nl/


Re: [whatwg] Encoding Sniffing

2012-04-21 Thread Silvia Pfeiffer
On Sat, Apr 21, 2012 at 8:21 PM, Anne van Kesteren ann...@opera.com wrote:
 Hey,

 This morning I looked into what it would take to define Encoding Sniffing.
 http://wiki.whatwg.org/wiki/Encoding#Sniffing has links as to what I looked
 at (minus Opera internal). As far as I can tell Gecko has the most
 comprehensive approach and should not be too hard to define (though writing
 it all out correctly and clear will be some work).

 I have some questions though:

 1) Is this something we want to define and eventually implement the same
 way?
 2) Does this need to apply outside HTML? For JavaScript it forbidden per the
 HTML standard at the moment. CSS and XML do not allow it either. Is it used
 for decoding text/plain at the moment?

We've had some discussion on the usefulness of this in WebVTT - mostly
just in relation with HTML, though I am sure that stand-alone video
players that decode WebVTT would find it useful, too.

Cheers,
Silvia.

 3) Is there a limit to how many bytes we should look at?

 Thanks,


 --
 Anne van Kesteren
 http://annevankesteren.nl/