Philip Taylor wrote:
...
Indeed, it would be good have this defined with the level of precision that HTML 5 has, so we can be sure implementations will be able to agree on how to extract RDFa from text/html content.

A few significant issues that I see in the current version:

What is "the @xml:lang attribute"? Is it the attribute with local name

It's unambiguous as long as we talk about a stream of characters, right?

"xml:lang" in no namespace (as would be produced by an HTML 5 parser (and by current HTML browser parser implementations))? or the attribute with local name "lang" in the namespace "http://www.w3.org/XML/1998/namespace"; (as would be produced by an XML parser, and could be inserted in an HTML document via DOM APIs)? or both (in which case both could be specified on one element, in addition to "lang" in no namespace)?

Both can only be specified in the DOM, but not in a serialization (or am I missing something?).

That being said, I wouldn't hurt to have a section that defines special aspects of processing RDFa from a DOM instead of a HTML document (as a series of bytes/characters).

"If the object of a triple would be an XMLLiteral, and the input to the processor is not well-formed [XML]" - I don't understand what that means in an HTML context. Is it meant to mean something like "the bytes in the HTML file that correspond to the contents of the relevant element could be parsed as well-formed XML (modulo various namespace declaration issues)"? If so, that seems impossible to implement. The input to the RDFa processor will most likely be a DOM, possibly manipulated by the DOM APIs rather than coming straight from an HTML parser, so it may never have had a byte representation at all.

Even without scripting, there isn't always a contiguous sequence of bytes corresponding to the content of an element. E.g. if the HTML input is:
  <table>
    <tr some-attributes-to-say-this-element-outputs-an-XMLLiteral>
      <td> This text goes inside the table </td>
      This text gets parsed to *outside* the table
      <td> This text goes inside the table </td>
    </tr>
  </table>
then (according to the HTML 5 parsing algorithm, and implemented in (at least) Firefox) the content of the <tr> element includes the first and third lines of text, but not the second. How would you decide whether the content is well-formed XML?

Is it still underspecified once we require a valid HTML5 document as input?

For this to make sense in real HTML implementations, the definition should be in terms of the document layer rather than the byte layer.

Disagreed. Many implementations never build a DOM. We're not only talking about browsers here.

...
How are xmlns:* attributes meant to be processed? E.g. what is the expected output in the following cases:

<div xmlns:T="test:">
  <span typeof="t:x" property="t:y">Test</span>
</div>

<div XMLNS:t="test:">
  <span typeof="t:x" property="t:y">Test</span>
</div>

<div xmlns:T="test:">
  <span typeof="T:x" property="T:y">Test</span>
</div>

<div xmlns:t="test:">
  <div xmlns:t="">
    <span typeof="t:x" property="t:y">Test</span>
  </div>
</div>

I would expect the results to be the same for XHTML and HTML serializations.

<div xmlns:t="test1:" id="d">
  <span typeof="t:x" property="t:y">Test</span>
</div>
<script>
  document.getElementById('d').setAttributeNS(
    'http://www.w3.org/2000/xmlns/', 'xmlns:t', 'test2:');
    /* (now the element has two distinct attributes,
       each in different namespaces) */
</script>

That example illustrates why it's dangerous to focus too much on processing in the DOM. Many RDFa processors will never execute the script. So I think considerations like the one above should be treated as a distinct problem (potentially in an appendix of the spec).

...

BR, Julian

Reply via email to