Philip Taylor wrote:
...
Indeed, it would be good have this defined with the level of precision
that HTML 5 has, so we can be sure implementations will be able to agree
on how to extract RDFa from text/html content.
A few significant issues that I see in the current version:
What is "the @xml:lang attribute"? Is it the attribute with local name
It's unambiguous as long as we talk about a stream of characters, right?
"xml:lang" in no namespace (as would be produced by an HTML 5 parser
(and by current HTML browser parser implementations))? or the attribute
with local name "lang" in the namespace
"http://www.w3.org/XML/1998/namespace" (as would be produced by an XML
parser, and could be inserted in an HTML document via DOM APIs)? or both
(in which case both could be specified on one element, in addition to
"lang" in no namespace)?
Both can only be specified in the DOM, but not in a serialization (or am
I missing something?).
That being said, I wouldn't hurt to have a section that defines special
aspects of processing RDFa from a DOM instead of a HTML document (as a
series of bytes/characters).
"If the object of a triple would be an XMLLiteral, and the input to the
processor is not well-formed [XML]" - I don't understand what that means
in an HTML context. Is it meant to mean something like "the bytes in the
HTML file that correspond to the contents of the relevant element could
be parsed as well-formed XML (modulo various namespace declaration
issues)"? If so, that seems impossible to implement. The input to the
RDFa processor will most likely be a DOM, possibly manipulated by the
DOM APIs rather than coming straight from an HTML parser, so it may
never have had a byte representation at all.
Even without scripting, there isn't always a contiguous sequence of
bytes corresponding to the content of an element. E.g. if the HTML input
is:
<table>
<tr some-attributes-to-say-this-element-outputs-an-XMLLiteral>
<td> This text goes inside the table </td>
This text gets parsed to *outside* the table
<td> This text goes inside the table </td>
</tr>
</table>
then (according to the HTML 5 parsing algorithm, and implemented in (at
least) Firefox) the content of the <tr> element includes the first and
third lines of text, but not the second. How would you decide whether
the content is well-formed XML?
Is it still underspecified once we require a valid HTML5 document as input?
For this to make sense in real HTML implementations, the definition
should be in terms of the document layer rather than the byte layer.
Disagreed. Many implementations never build a DOM. We're not only
talking about browsers here.
...
How are xmlns:* attributes meant to be processed? E.g. what is the
expected output in the following cases:
<div xmlns:T="test:">
<span typeof="t:x" property="t:y">Test</span>
</div>
<div XMLNS:t="test:">
<span typeof="t:x" property="t:y">Test</span>
</div>
<div xmlns:T="test:">
<span typeof="T:x" property="T:y">Test</span>
</div>
<div xmlns:t="test:">
<div xmlns:t="">
<span typeof="t:x" property="t:y">Test</span>
</div>
</div>
I would expect the results to be the same for XHTML and HTML serializations.
<div xmlns:t="test1:" id="d">
<span typeof="t:x" property="t:y">Test</span>
</div>
<script>
document.getElementById('d').setAttributeNS(
'http://www.w3.org/2000/xmlns/', 'xmlns:t', 'test2:');
/* (now the element has two distinct attributes,
each in different namespaces) */
</script>
That example illustrates why it's dangerous to focus too much on
processing in the DOM. Many RDFa processors will never execute the
script. So I think considerations like the one above should be treated
as a distinct problem (potentially in an appendix of the spec).
...
BR, Julian