Minor correction: I wrote:
In a HTML5 text/html serialisation with no scripting, you can only get
the attribute "xml:lang" in no namespace.
which I think is wrong because of foreign content: you can write <div
xml:lang=a><svg xml:lang=b></svg></div>, which will result in one
attribute called "xml:lang" in no namespace on the div, and one called
"lang" in the XML namespace on the svg. (But you can't get both on the
same element, unless I'm wrong again.)
Julian Reschke wrote:
Is it still underspecified once we require a valid HTML5 document as
input?
Probably not. But I wouldn't consider it acceptable to require a valid
document as input - people make mistakes all the time, and I want them
to get consistent (and hopefully predictable) RDF triples out of it
regardless of what implementation they use, so the specification has
to deal precisely with invalid input. See
http://lists.w3.org/Archives/Public/public-rdf-in-xhtml-tf/2009May/0156.html
for an example of someone with precisely this kind of error.
Understood; I just wanted to understand the scope of the problem.
Okay, sure. My original comment was in the context of there not
necessarily being a contiguous sequence of characters that corresponds
to a parsed element, and I think that's closely related (perhaps the
same as?) the concept of streamability (basically the ability to output
SAX events without buffering elements). The current streamability
violations come from:
* Content inside <table>/<tr> instead of inside <td>
* Misnested <i>...<p>...</i>...</p>, <i>...<b>...</i>...</b>, etc
* Head content (link, meta, etc) between </head> and <body>
* Multiple <html> or <body> elements (their attributes get merged)
* Content after </body>
* (Can't think of any others)
Those are all parse errors, and a conforming parser is allowed to abort
when it sees a parse error, though many of them are quite common in the
wild.
In any other case, it seems like it ought to be theoretically possible
to find a substring of the document that corresponds to the content of
an element, though I may be missing some subtleties. (But current parser
implementations don't do that, and I don't think they would willingly do
so - they throw away the input stream and all they can do is
re-serialise the parsed output.)
By "DOM" I generally mean any kind of tree structure of elements and
attributes, either as an explicit data structure (DOM, XOM,
ElementTree) or implicit (SAX). Would any RDFa implementation *not*
parse the input HTML into that kind of structure and operate over the
elements and attributes as distinct objects? (e.g. would they just use
regular expressions over the input byte stream? That seems quite
infeasible to me...)
Depends on the definition of "tree structure". I've been involved in
code that just uses a tokenizer and specialized stack, and
implementations like these will not do the re-arranging of elements the
HTML5 spec specifies for some kinds of broken input.
If they abort when there are streamability violations, that's fine (and
is what the Validator.nu parser's unbuffered SAX output does) - the
stream of start/end element events will always be well-nested and will
encode a tree structure, and it would be possible to specify DOM-based
algorithms that could be easily mapped onto that non-DOM implementation.
If they don't abort and instead do some different kind of error
handling, then they're not a conforming HTML5 parser, and in that case
we've already failed at the goal of getting consistent behaviour.
[...]
That's impossible, at least for now as RDFa-in-XHTML relies on
XML-NS-wellformedness (so XMLNS:* would be recognized as namespace
declaration, right?).
Hmm, maybe a better example of what I intended is:
<div xmlns:t="test1:">
<div xmlns:T="test2:">
<span property="t:x T:y">Test</span>
</div>
</div>
which is well-formed XML and has a clear definition in RDFa-in-XHTML,
but the defined behaviour is impossible to reproduce in text/html
(because xmlns:t and xmlns:T (and XMLNS:T) are parsed identically by an
HTML parser and there's no way to distinguish them afterwards).
RDFa-in-text/html could:
* Assume attributes are all treated as lowercase (breaking <div
xmlns:T="..." property="T:..."> which works in XHTML);
* Say CURIEs (in both XHTML and HTML) match prefixes case-insensitively
(breaking compatibility with current implementations);
* Change text/html parsing to preserve attribute case (breaking
compatibility with current parsers);
* Use some other prefix-binding mechanism (in both XHTML in HTML) like
prefix="t=... T=..." instead of xmlns:t="..." (breaking current
implementations and deployed content, but avoiding the mess of parsing
differences between XHTML and HTML).
I can't think of any other solutions, so something is going to break no
matter what is chosen.
--
Philip Taylor
pj...@cam.ac.uk