Re: [whatwg] Comment Syntax and Parsing

2006-01-24 Thread Ian Hickson
On Tue, 24 Jan 2006, Lachlan Hunt wrote:
 
 As for how to parse it, I'll use these test cases to demonstrate what I 
 consider to be the most sane way to handle comments.  (Assume EOF at the 
 end of each one)
 
 Test Case  | Comment Content  | Output
 ---|--|--
 PA!SS|| PASS
 PA! -SS  |  - | PASS
 PA! --SS || PASS
 PA!-SS   | -  | PASS
 PA!- -SS | - -| PASS
 PA!- -SS -- | - -| PASS --

Agreed.


 PA!- !--SS --  | - !   | PASS --

Comment should be - !-- IMHO. It's still a bogus comment (in HTML5 
nomenclature), the -- part is irrelevant.


 PA!- !-- -SS --| - !-- -   | PASS --

Agreed.


 PA!- --SS| -  | PASS
 PA!- -- SS   | -  | PASS

These are bogus comments, so again, they should be - -- and - --  
respectively, IMHO.


 PA!-- FAIL --SS  |  FAIL  | PASS
 PA!-- FAIL --SS |  FAIL | PASS
 PA!-- FAIL !-- --SS|  FAIL !--| PASS
 PA!-- FAIL !-- -- --SS |  FAIL !-- -- | PASS

Agreed.


 PA!--  FAIL -- SS   |   FAIL| PASS

Disagree. The terminator should be --, not -- S* . I don't see any 
good reason to have -- S* .


 P!-- -- AS!-- --S  |   (2 comments) | PASS

Disagree (same reason).  -- AS!--  is the comment, output is PS.


 PA!-- FAIL -- FAIL --SS  |  FAIL -- FAIL  | PASS
 P!-- -- --AS!-- -- --S |  --  (2 comments)  | PASS
 PA!-- -- -- --SS |  -- -- | PASS
 PA!-- FAIL -- FAIL -- FAIL --SS  |  FAIL -- FAIL -- FAIL  | PASS
 PA!--- FAIL --SS | - FAIL | PASS
 PA!--- FAIL ---SS| - FAIL -   | PASS
 !-- -FAIL|  -FAIL|
 !--- -FAIL   | - -FAIL   |
 PA!-SS  | - | PASS

Agreed.


 !-- --- -| (not sure)   |

Comment text is  --- -.


 PA!-- --- --SS   |  ---   | PASS
 PA!--- --- ---SS | - --- -| PASS

Agreed.

-- 
Ian Hickson   U+1047E)\._.,--,'``.fL
http://ln.hixie.ch/   U+263A/,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'


Re: [whatwg] Comment Syntax and Parsing

2006-01-24 Thread Lachlan Hunt

Ian Hickson wrote:

On Tue, 24 Jan 2006, Lachlan Hunt wrote:

PA!- !--SS --  | - !   | PASS --


Comment should be - !-- IMHO. It's still a bogus comment (in HTML5 
nomenclature), the -- part is irrelevant.


Ok, so if a comment only starts with '!' then it ends at the first '' 
only (ignoring any '--'), but if a comment starts with '!--' then it 
must end with '--'.



PA!- --SS| -  | PASS
PA!- -- SS   | -  | PASS


These are bogus comments, so again, they should be - -- and - --  
respectively, IMHO.


Ok.


PA!--  FAIL -- SS   |   FAIL| PASS


Disagree. The terminator should be --, not -- S* . I don't see any 
good reason to have -- S* .


I was working on the assumption that the comment would end at the first 
occurance of '' while in the comment end state, but that whitespace 
would be ignored while searching for it.  Several browsers already 
handle it like that including Mozilla, Opera and Safari (except in 
Opera, the comment contained   FAIL -).  Although IE, OmniWeb and 
iCab failed.


--
Lachlan Hunt
http://lachy.id.au/



Re: [whatwg] Comment Syntax and Parsing

2006-01-24 Thread Ian Hickson
On Wed, 25 Jan 2006, Lachlan Hunt wrote:

 Ian Hickson wrote:
  On Tue, 24 Jan 2006, Lachlan Hunt wrote:
   PA!- !--SS --  | - !   | PASS --
  
  Comment should be - !-- IMHO. It's still a bogus comment (in HTML5
  nomenclature), the -- part is irrelevant.
 
 Ok, so if a comment only starts with '!' then it ends at the first '' only
 (ignoring any '--'), but if a comment starts with '!--' then it must end with
 '--'.

Right. They end up in different parse states (bogus comment or bogus 
tag or something, vs comment or something). This is for compatibility 
with existing UAs -- basically it's not a comment really, just a malformed 
tag that happens to be turned into a Comment node in the DOM.


   PA!--  FAIL -- SS   |   FAIL| PASS
  
  Disagree. The terminator should be --, not -- S* . I don't see any
  good reason to have -- S* .
 
 I was working on the assumption that the comment would end at the first 
 occurance of '' while in the comment end state, but that whitespace 
 would be ignored while searching for it.  Several browsers already 
 handle it like that including Mozilla, Opera and Safari (except in 
 Opera, the comment contained   FAIL -).  Although IE, OmniWeb and 
 iCab failed.

Really? In my testing, browsers didn't reliably do this. Were you testing 
standards mode or quirks mode? Did you have the potential to be hitting 
unexpected-EOF-reparse behaviour, or was it definitely the first-parse 
behaviour?

-- 
Ian Hickson   U+1047E)\._.,--,'``.fL
http://ln.hixie.ch/   U+263A/,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'


Re: [whatwg] Comment Syntax and Parsing

2006-01-24 Thread Lachlan Hunt

Ian Hickson wrote:

On Wed, 25 Jan 2006, Lachlan Hunt wrote:

Ian Hickson wrote:

On Tue, 24 Jan 2006, Lachlan Hunt wrote:

PA!--  FAIL -- SS   |   FAIL| PASS

Disagree. The terminator should be --, not -- S* . I don't see any
good reason to have -- S* .
I was working on the assumption that the comment would end at the first 
occurance of '' while in the comment end state, but that whitespace 
would be ignored while searching for it.  Several browsers already 
handle it like that including Mozilla, Opera and Safari (except in 
Opera, the comment contained   FAIL -).  Although IE, OmniWeb and 
iCab failed.


Really? In my testing, browsers didn't reliably do this. Were you testing 
standards mode or quirks mode? Did you have the potential to be hitting 
unexpected-EOF-reparse behaviour, or was it definitely the first-parse 
behaviour?


I tested the following in the live dom viewer using Firefox 1.5.0.1 Win 
and Mac, Opera 8.5/Mac, Opera 9 Win and Mac, Safari 2.0.3, IE6, OmniWeb 
5.1.2 and iCab 3.0.1.


!DOCTYPE html
PA!--  FAIL -- SS

Browser   | Comment | Rendered
--|-|---
Firefox   |   FAIL   | PASS
O 8.5/Mac |   FAIL - | PASS
O 9.0/Mac |   FAIL   | PASS
O 9.0/Win |   FAIL   | PASS
Safari| (not shown) | PASS
IE6   | (not shown) | PA FAIL -- SS
iCab  | (not shown) | PA FAIL -- SS
OmniWeb   | (not shown) | PA FAIL -- SS

(The live dom viewer didn't work for OmniWeb, I just used an HTML file 
instead)


--
Lachlan Hunt
http://lachy.id.au/



Re: [whatwg] Comment Syntax and Parsing

2006-01-24 Thread Lachlan Hunt

Ian Hickson wrote:

On Wed, 25 Jan 2006, Lachlan Hunt wrote:
I tested the following in the live dom viewer using Firefox 1.5.0.1 Win 
and Mac, Opera 8.5/Mac, Opera 9 Win and Mac, Safari 2.0.3, IE6, OmniWeb 
5.1.2 and iCab 3.0.1.


!DOCTYPE html
PA!--  FAIL -- SS


This triggers SGML comment parsing mode (which you don't want to be 
testing) in a number of browsers.


Why?  The closer we can define the behaviour to be compatible with 
existing standards mode behaviours, the better it will be for backwards 
compatibility?


--
Lachlan Hunt
http://lachy.id.au/



Re: [whatwg] Comment Syntax and Parsing

2006-01-24 Thread Ian Hickson
On Wed, 25 Jan 2006, Lachlan Hunt wrote:

 Ian Hickson wrote:
  On Wed, 25 Jan 2006, Lachlan Hunt wrote:
   I tested the following in the live dom viewer using Firefox 1.5.0.1 Win
   and Mac, Opera 8.5/Mac, Opera 9 Win and Mac, Safari 2.0.3, IE6, OmniWeb
   5.1.2 and iCab 3.0.1.
   
   !DOCTYPE html
   PA!--  FAIL -- SS
  
  This triggers SGML comment parsing mode (which you don't want to be testing)
  in a number of browsers.
 
 Why?  The closer we can define the behaviour to be compatible with existing
 standards mode behaviours, the better it will be for backwards compatibility?

This entire discussion started from the developers of all the browsers who 
implemented the SGML comment mode coming to me and telling me I was stupid 
for even suggesting that this is how comments should be parsed. The whole 
point of all this is to simplify comment parsing.

-- 
Ian Hickson   U+1047E)\._.,--,'``.fL
http://ln.hixie.ch/   U+263A/,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'


Re: [whatwg] Comment Syntax and Parsing

2006-01-24 Thread Anne van Kesteren

Quoting Lachlan Hunt [EMAIL PROTECTED]:
This entire discussion started from the developers of all the 
browsers who implemented the SGML comment mode coming to me and 
telling me I was stupid for even suggesting that this is how 
comments should be parsed. The whole point of all this is to 
simplify comment parsing.


Yes, and I agree with that.  But, besides Mozilla, which of those 
browser versions that I tested actually have SGML comments enabled?


Opera 9 I assume. If I remember correctly the SGML thing was fixed before the
preview. We currently plan on going back to normal comment handling for the
moment. So you could use Opera 8.5 if you do not want SGML comment handling.


--
Anne van Kesteren
http://annevankesteren.nl/



Re: [whatwg] Comment Syntax and Parsing

2006-01-24 Thread HÃ¥kon Wium Lie
Also sprach Ian Hickson:

This triggers SGML comment parsing mode (which you don't want to be 
testing)
in a number of browsers.
   
   Why?  The closer we can define the behaviour to be compatible with existing
   standards mode behaviours, the better it will be for backwards 
   compatibility?
  
  This entire discussion started from the developers of all the browsers who 
  implemented the SGML comment mode coming to me and telling me I was stupid 
  for even suggesting that this is how comments should be parsed. The whole 
  point of all this is to simplify comment parsing.

Right. And since I run out of memory trying to parse a sentence with
the word simple and SGML in it...

Oops. Core dumped.

-hkon




Re: [whatwg] Comment Syntax and Parsing

2006-01-23 Thread Henri Sivonen

On Jan 23, 2006, at 05:23, Ian Hickson wrote:

Probably the same as XML. Or maybe just !-- followed by zero or  
more

characters other than U+, followed by --.


Of those two choices, I prefer the former. I don't like the idea of  
expanding the set of conforming comments, because I think having  
conforming comments should maximize the backwards-compatibility of  
the comments (and there are browsers in the wild that implement SGML- 
style comments, which is incompatible with the latter alternative  
above).


I think allowing paired double hyphens with whitespace in between and  
allowing whitespace between the ending -- and  would make sense.  
This would improve the source-level upgradeability of valid HTML 4 to  
conforming HTML 5. However, it would have the old confusion issues.


!-- I think this should be conforming. --
!-- Making --   -- this conforming would make sense as well. --   
!-- IMO, this -- should not be conforming but should parse  
unambiguously with an easy parse error. --


--
Henri Sivonen
[EMAIL PROTECTED]
http://hsivonen.iki.fi/




Re: [whatwg] Comment Syntax and Parsing

2006-01-23 Thread Anne van Kesteren

Quoting Henri Sivonen [EMAIL PROTECTED]:
I think allowing paired double hyphens with whitespace in between  
and allowing whitespace between the ending -- and  would make  
sense. This would improve the source-level upgradeability of valid  
HTML 4 to conforming HTML 5. However, it would have the old  
confusion issues.


!-- I think this should be conforming. --
!-- Making --   -- this conforming would make sense as well. --   
!-- IMO, this -- should not be conforming but should parse  
unambiguously with an easy parse error. --


And then it would be necessary to make this one non-conforming:
!-- In comment --  --  Not in HTML 5 comment but in SGML comment --

I guess the XML style is the simplest thing that could work. :-/


You are talking about conformance, but what do you want the parser to do? And
also there is talk about whitespace between -- and  but currently all 
kinds of

chracters are allowed there (including - for instance).


--
Anne van Kesteren
http://annevankesteren.nl/



Re: [whatwg] Comment Syntax and Parsing

2006-01-23 Thread Ian Hickson
On Mon, 23 Jan 2006, Lachlan Hunt wrote:
 
 Well that depends on the implementation and how SGML defines that such 
 erroneous comments be handled.

Indeed, there is that too. Whatever behaviour we require will be, to some 
extent, new behaviour.


 (Without a copy of IS0O-8879 handy, it's difficult to check, so the 
 following is based purely on observing the implementations.)

ISO 8879:1986 (including its 1996 and 1998 annexes) doesn't cover, as far 
as I can tell, error handling requirements for parsers.


 Do you know if browsers will be using this for both standards and quirks 
 mode or will they retain their existing quirks mode parsing and use this 
 as the new standards mode parsing only?

I imagine that any changes to quirks mode handling will be done very 
carefully over an extended period of time.


 Well, many authors believe their using XHTML, and many even believe they 
 using the correct XHTML MIME Type (using meta), even though they're 
 not.  So, regardless of whether they actually are or not, they're going 
 to believe they are and it's best not to confuse them more by saying:

! isn't well-formed XML

Fair enough. I've made it a parse error (which is what determines what 
conformance checkers must say regarding valid vs invalid syntax).


 ...have them come back and say:
the validator says it's fine
 
 and then tell them:
   that's because the document isn't XHTML.
 
 only to hear:
   Yes it is, look at the meta element and all these slashes (br/)

br/ will also be flagged as a parse error, for what it's worth.


On Mon, 23 Jan 2006, Henri Sivonen wrote:

 [...]

By the way, Henri, thanks for your comments a few months back about 
parsing. I've been using them, and have agreed and implemented most of 
them in the spec so far. I'll reply to them in more detail in due course.


 I think allowing paired double hyphens with whitespace in between [would 
 make sense]

That seems like excessive complexity for conformance checkers, with very 
little benefit (beyond the theoretical).


 and allowing whitespace between the ending -- and  would make 
 sense.

This also seems a little gratuitous.


 This would improve the source-level upgradeability of valid HTML 4 to 
 conforming HTML 5. However, it would have the old confusion issues.

I think those issues outweigh the benefits you mention.


 I guess the XML style is the simplest thing that could work. :-/

I agree. :-)

-- 
Ian Hickson   U+1047E)\._.,--,'``.fL
http://ln.hixie.ch/   U+263A/,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'


Re: [whatwg] Comment Syntax and Parsing

2006-01-23 Thread Henri Sivonen

On Jan 23, 2006, at 11:39, Anne van Kesteren wrote:


I guess the XML style is the simplest thing that could work. :-/


You are talking about conformance, but what do you want the parser  
to do?


I talked about conformance, because I'd prefer document conformance  
be defined in such a way that conforming comments maximize  
compatibility with different parsers.


I did not say anything about how I want non-conforming comments to be  
handled, because I think Hixie has researched the issue so much more  
than I that I don't have anything educated enough to say right now.


--
Henri Sivonen
[EMAIL PROTECTED]
http://hsivonen.iki.fi/




Re: [whatwg] Comment Syntax and Parsing

2006-01-22 Thread Ian Hickson
On Mon, 23 Jan 2006, Lachlan Hunt wrote:
 
 Well, for what it's worth, I still don't think you were being stupid, I think
 you were right all along and had this been implemented by more than just
 Mozilla 7 years ago, the result may have been different.

Authors find the -- thing unbelievably confusing.

Why does:

  !-- Hello
-- World
-- How does comment work?
-- I don't know.
-- Do you?
--

...work, but this:

  !-- Hello World
-- How does comment work?
-- I don't know.
-- Do you?
--

...or this:

  !-- Hello
-- World
-- How does comment work?
-- I don't know. Do you?
--

...not? Authors just don't get it.

It makes more sense when you have draconian error handling, but HTML 
doesn't.


 [...] all of those vendors have unanimously voted against implementing 
 proper comment handling in favour of quirks-mode-style parsing, there 
 really isn't a choice in the matter.

(What HTML5 says isn't really quirks mode comment parsing, it's even 
simpler.)


  Probably the same as XML. Or maybe just !-- followed by zero or 
  more characters other than U+, followed by --.
 
 I vote for keeping it very similar to XML, it'll be easier for authors 
 only having to learn and remember one comment syntax.

Plus CSS's. Plus Javascript's. So three syntaxes, at least.

...and this is assuming they'll ever use XML.


  Yeah. The question is do we really want to confuse people by telling 
  them that their comment is invalid when they write:
  
 !-
 
 Yes, for backwards compatibility reasons.

Fair enough. We can always allow it later.


 Another question is, do we wish to continue allowing white space like this:
 !-- comment --   
 
 I believe it's supported by all browsers without any difficulty

Actually, it isn't. In most browsers that I tested the above gets treated 
as an unclosed comment which is then re-parsed in close at first  mode. 
Since we're dropping the re-parse mode (see earlier mails), this goes away 
with it.

You can test whether or not it's really supported by comparing these:

   !--  -- -- EOF
   !--  --  -- EOF
   !--  -- EOF
   !--  --  EOF

...in my script:

   http://software.hixie.ch/utilities/js/live-dom-viewer/

-- 
Ian Hickson   U+1047E)\._.,--,'``.fL
http://ln.hixie.ch/   U+263A/,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'


Re: [whatwg] Comment Syntax and Parsing

2006-01-22 Thread Lachlan Hunt

Ian Hickson wrote:

On Mon, 23 Jan 2006, Lachlan Hunt wrote:

Well, for what it's worth, I still don't think you were being stupid, I think
you were right all along and had this been implemented by more than just
Mozilla 7 years ago, the result may have been different.


Authors find the -- thing unbelievably confusing.


Oh, yes, absolutely.  I know, I've tried explaining it to some with 
varying degrees of success.



Why does:

  !-- Hello
-- World
-- How does comment work?
-- I don't know.
-- Do you?
--

...work,


Well that depends on the implementation and how SGML defines that such 
erroneous comments be handled.  (Without a copy of IS0O-8879 handy, it's 
difficult to check, so the following is based purely on observing the 
implementations.)


Mozilla will handle that entirely as a single comment, which is closed 
at the occurance of -- at the end.


onsgmls, however, (which is more likely to be closer to the SGML spec) 
will encounter the 'W' in 'World', which is outside of the comment, 
treat it as an erroneous unclosed comment declaration and implicity 
close it.  It will then drop the 'W' completely and continue on, 
treating comment as an unknown and unclosed element along the way 
(assuming an HTML doctype is used).


So, basically, none of those examples actually work, they just appear 
to work in some implementations.


(What HTML5 says isn't really quirks mode comment parsing, it's even 
simpler.)


Ok, well then I don't have a clue how quirks mode parsing works, it's 
just too unpredictable.  I'm glad this is going to be simpler.  Do you 
know if browsers will be using this for both standards and quirks mode 
or will they retain their existing quirks mode parsing and use this as 
the new standards mode parsing only?


Probably the same as XML. Or maybe just !-- followed by zero or 
more characters other than U+, followed by --.
I vote for keeping it very similar to XML, it'll be easier for authors 
only having to learn and remember one comment syntax.


Plus CSS's. Plus Javascript's. So three syntaxes, at least.


Yes, but authors don't confuse CSS and JavaScript as being the same 
language as HTML as often as they confuse HTML and XHTML as being the same.



...and this is assuming they'll ever use XML.


Well, many authors believe their using XHTML, and many even believe they 
using the correct XHTML MIME Type (using meta), even though they're 
not.  So, regardless of whether they actually are or not, they're going 
to believe they are and it's best not to confuse them more by saying:

   ! isn't well-formed XML

and have them come back and say:
   the validator says it's fine

and then tell them:
  that's because the document isn't XHTML.

only to hear:
  Yes it is, look at the meta element and all these slashes (br/)


Another question is, do we wish to continue allowing white space like this:
!-- comment --   

I believe it's supported by all browsers without any difficulty


Actually, it isn't. In most browsers that I tested the above gets treated 
as an unclosed comment which is then re-parsed in close at first  mode.


You're right, but IE was the only browser that I could find which (in 
standards mode) treated it like that.


Since we're dropping the re-parse mode (see earlier mails), this goes away 
with it.


OK.

--
Lachlan Hunt
http://lachy.id.au/