Re: HTML 4 Profile for RDFa

Philip Taylor Sat, 23 May 2009 09:50:39 -0700

Minor correction: I wrote:

In a HTML5 text/html serialisation with no scripting, you can only getthe attribute "xml:lang" in no namespace.

which I think is wrong because of foreign content: you can write <divxml:lang=a><svg xml:lang=b></svg></div>, which will result in oneattribute called "xml:lang" in no namespace on the div, and one called"lang" in the XML namespace on the svg. (But you can't get both on thesame element, unless I'm wrong again.)



Julian Reschke wrote:

Is it still underspecified once we require a valid HTML5 document asinput?
Probably not. But I wouldn't consider it acceptable to require a validdocument as input - people make mistakes all the time, and I want themto get consistent (and hopefully predictable) RDF triples out of itregardless of what implementation they use, so the specification hasto deal precisely with invalid input. Seehttp://lists.w3.org/Archives/Public/public-rdf-in-xhtml-tf/2009May/0156.htmlfor an example of someone with precisely this kind of error.
Understood; I just wanted to understand the scope of the problem.

Okay, sure. My original comment was in the context of there notnecessarily being a contiguous sequence of characters that correspondsto a parsed element, and I think that's closely related (perhaps thesame as?) the concept of streamability (basically the ability to outputSAX events without buffering elements). The current streamabilityviolations come from:


* Content inside <table>/<tr> instead of inside <td>
* Misnested <i>...<p>...</i>...</p>, <i>...<b>...</i>...</b>, etc
* Head content (link, meta, etc) between </head> and <body>
* Multiple <html> or <body> elements (their attributes get merged)
* Content after </body>
* (Can't think of any others)

Those are all parse errors, and a conforming parser is allowed to abortwhen it sees a parse error, though many of them are quite common in thewild.

In any other case, it seems like it ought to be theoretically possibleto find a substring of the document that corresponds to the content ofan element, though I may be missing some subtleties. (But current parserimplementations don't do that, and I don't think they would willingly doso - they throw away the input stream and all they can do isre-serialise the parsed output.)

By "DOM" I generally mean any kind of tree structure of elements andattributes, either as an explicit data structure (DOM, XOM,ElementTree) or implicit (SAX). Would any RDFa implementation *not*parse the input HTML into that kind of structure and operate over theelements and attributes as distinct objects? (e.g. would they just useregular expressions over the input byte stream? That seems quiteinfeasible to me...)
Depends on the definition of "tree structure". I've been involved incode that just uses a tokenizer and specialized stack, andimplementations like these will not do the re-arranging of elements theHTML5 spec specifies for some kinds of broken input.

If they abort when there are streamability violations, that's fine (andis what the Validator.nu parser's unbuffered SAX output does) - thestream of start/end element events will always be well-nested and willencode a tree structure, and it would be possible to specify DOM-basedalgorithms that could be easily mapped onto that non-DOM implementation.

If they don't abort and instead do some different kind of errorhandling, then they're not a conforming HTML5 parser, and in that casewe've already failed at the goal of getting consistent behaviour.

[...]
That's impossible, at least for now as RDFa-in-XHTML relies onXML-NS-wellformedness (so XMLNS:* would be recognized as namespacedeclaration, right?).


Hmm, maybe a better example of what I intended is:

  <div xmlns:t="test1:">
    <div xmlns:T="test2:">
      <span property="t:x T:y">Test</span>
    </div>
  </div>

which is well-formed XML and has a clear definition in RDFa-in-XHTML,but the defined behaviour is impossible to reproduce in text/html(because xmlns:t and xmlns:T (and XMLNS:T) are parsed identically by anHTML parser and there's no way to distinguish them afterwards).


RDFa-in-text/html could:

* Assume attributes are all treated as lowercase (breaking <divxmlns:T="..." property="T:..."> which works in XHTML);

* Say CURIEs (in both XHTML and HTML) match prefixes case-insensitively(breaking compatibility with current implementations);

* Change text/html parsing to preserve attribute case (breakingcompatibility with current parsers);

* Use some other prefix-binding mechanism (in both XHTML in HTML) likeprefix="t=... T=..." instead of xmlns:t="..." (breaking currentimplementations and deployed content, but avoiding the mess of parsingdifferences between XHTML and HTML).

I can't think of any other solutions, so something is going to break nomatter what is chosen.


--
Philip Taylor
pj...@cam.ac.uk

Re: HTML 4 Profile for RDFa

Reply via email to