Re: HTML 4 Profile for RDFa

Shane McCarron Thu, 14 May 2009 14:25:06 -0700


Philip Taylor wrote:

Indeed, it would be good have this defined with the level of precisionthat HTML 5 has, so we can be sure implementations will be able toagree on how to extract RDFa from text/html content.
A few significant issues that I see in the current version:
What is "the @xml:lang attribute"? Is it the attribute with local name"xml:lang" in no namespace (as would be produced by an HTML 5 parser(and by current HTML browser parser implementations))? or theattribute with local name "lang" in the namespace"http://www.w3.org/XML/1998/namespace"; (as would be produced by an XMLparser, and could be inserted in an HTML document via DOM APIs)? orboth (in which case both could be specified on one element, inaddition to "lang" in no namespace)?

Well - remember that the document you are looking at is written in thecontext of HTML 4. In HTML 4 none of what you say above makes anysense. Attributes are tokens - and the token "xml:lang" is what I wastalking about. In HTML 4 those attribute names are case-insensitive -I need to add something about that to the draft. Thanks for the reminder!

"If the object of a triple would be an XMLLiteral, and the input tothe processor is not well-formed [XML]" - I don't understand what thatmeans in an HTML context. Is it meant to mean something like "thebytes in the HTML file that correspond to the contents of the relevantelement could be parsed as well-formed XML (modulo various namespacedeclaration issues)"? If so, that seems impossible to implement. Theinput to the RDFa processor will most likely be a DOM, possiblymanipulated by the DOM APIs rather than coming straight from an HTMLparser, so it may never have had a byte representation at all.

We have no presumption of how an RDFa processor is implemented. Itmight be client side via a browser. It might be server side. It mightbe part of an XML tool-chain. It doesn't really matter. In this case,the document I wrote is a little too fuzzy because the idea is notcompletely cooked yet. Here's the problem: RDFa permits the creationof objects that are of type XMLLiteral. That datatype is tightlydefined, and as you can imagine it is expected to contain well-formedXML. If a Conforming RDFa Processor were to generate triples thatcontained data of type XMLLiteral, and that data were not "well-formed"as defined in XML, then consumers of that data could easily be verysurprised!

Even without scripting, there isn't always a contiguous sequence ofbytes corresponding to the content of an element. E.g. if the HTMLinput is:
  <table>
    <tr some-attributes-to-say-this-element-outputs-an-XMLLiteral>
      <td> This text goes inside the table </td>
      This text gets parsed to *outside* the table
      <td> This text goes inside the table </td>
    </tr>
  </table>
then (according to the HTML 5 parsing algorithm, and implemented in(at least) Firefox) the content of the <tr> element includes the firstand third lines of text, but not the second. How would you decidewhether the content is well-formed XML?

Yeah - tricky. I think you need to take a step back and think aboutgoals rather than implementation strategies. The goal here is that allimplementations extract the same collection of triples from a givendocument. There are a lot of ways to achieve that. In the XHTMLprofile of RDFa we relied upon the XML parsing model. Consequently, weare confident that we are being handed well-formed content. If you doan implementation via the DOM, you can also be confident of that, sinceby the time content gets into the DOM you can assume the processor hasdone whatever magic was necessary and you have a node tree that youcould turn back into content that would be well-formed. If you arewriting your own parser that sifts through a document character bycharacter... well, you are going to have some work ahead of you!

With regard to your example above.... if I had a DOM based processor, Iwould have missed out on line 3 I imagine. If I wrote my own I wouldhave included it ('cause that is well formed - the XML parser would havehanded it to me). In the XHTML profile we (sort of ) address this inthat we only tightly constrain behavior for *valid* content. Thecontent above is *invalid* according to the XHTML+RDFa schema - so whilethe behavior of existing implementations might be inconsistent, Ipersonally won't get too excited about it.

In the HTML profile of RDFa, things are much the same. We can attemptto be very very precise about how the parsing of the content should behandled, or we can rely upon the parsing model spelled out by theunderlying specification (HTML 4 in this case). Now, I am sure you willagree that HTML 4 does a pretty poor job of defining the parsing model,but.... is it adequate for our needs in this instance? My belief isthat it is adequate - at least for the vast majority of the RDFaprocessing rules. In particular in that most implementors will relyupon existing parsing libraries, and the problems associated with thatparsing have been largely sorted out over the years. Even to the pointthat they are being codified in the early draft HTML5 documents.

The only place I have a concern is with regard to creating XMLLiterals.This is a very powerful aspect of RDFa, and I am loathe to disable it inthe HTML profile if I don't have to. Instead, I would like to identifya light-weight model that implementors can use. For example, we couldsay that if an object is of type XMLLiteral, then its content is escapedso that there is no markup (< to < etc). This would mean that it is"well formed XML", and that it could be turned back into its originalsource form, which is the goal of such content. However, it would alsomean that a consumer of such content would need to know this and do thereverse transformation before using the content. I don't know what theright answer is - maybe we can figure it out together?

For this to make sense in real HTML implementations, the definitionshould be in terms of the document layer rather than the byte layer.(The XMLLiteral should be an XML-fragment serialisation of theelement, and some error handling (like ignoring the triple) wouldoccur if it's impossible to serialise as XML, similar to therequirements in<http://www.whatwg.org/specs/web-apps/current-work/multipage/the-xhtml-syntax.html#serializing-xhtml-fragments>)

In HTML 5, where there is an XML serialisation method, that might makesense. In HTML 4 however, we don't have that luxury. I suppose wecould say that the HTML 4 content is transformed into correspondingXHTML 1.0 content... but there are no reliable serializers out therethat do that really.

How are xmlns:* attributes meant to be processed? E.g. what is theexpected output in the following cases:


<div xmlns:T="test:">
  <span typeof="t:x" property="t:y">Test</span>
</div>

<div XMLNS:t="test:">
  <span typeof="t:x" property="t:y">Test</span>
</div>

<div xmlns:T="test:">
  <span typeof="T:x" property="T:y">Test</span>
</div>

<div xmlns:t="test:">
  <div xmlns:t="">
    <span typeof="t:x" property="t:y">Test</span>
  </div>
</div>

<div xmlns:t="test1:" id="d">
  <span typeof="t:x" property="t:y">Test</span>
</div>
<script>
  document.getElementById('d').setAttributeNS(
    'http://www.w3.org/2000/xmlns/', 'xmlns:t', 'test2:');
    /* (now the element has two distinct attributes,
       each in different namespaces) */
</script>

I had not thought about this much before. Attribute names in HTML /SGML are case-insensitive. CURIE prefix names are of course NOT.However, I can almost guarantee you that browser-based implementationsof the XHTML profile right now would fail to work correctly when facedwith CURIE prefixes that differ only in case. Interesting point - I amgoing to test that later.

I think we would be wise to advise document authors to not defineprefixes that differ only in case. And in the HTML profile I think itwould be reasonable to require that prefix names are mapped tolower-case during processing. Or some other solution that gets us tothe point where a browser-based implementation that requests attributenames from a DOM node can still work. My conclusion here is that prefixnames should be treated case-insensitively in the HTML profile. Do youagree?

Should the same processing rules be used for documents from both HTMLand XHTML parsers, or would DOM-based implementations need to detectwhere the input came from and switch processing rules accordingly? Ifthere is a difference, what happens if I adoptNode from an XHTMLdocument into an HTML document, or vice versa?

Err... What's adoptNode? And how are these two documents gettingtogether? I mean, that's sort of out of scope of an HTML 4 profile forRDFa. With regard to the first part of the question, I believe the sameprocessing rules can be used. I have an implementation that does itnow. So do lots of other people. My implementation is DOM basedthough, so that makes it relatively simple to have the same rules work.


Thanks for your comments!

--
Shane P. McCarron                          Phone: +1 763 786-8160 x120
Managing Director                            Fax: +1 763 786-8180
ApTest Minnesota                            Inet: sh...@aptest.com

Re: HTML 4 Profile for RDFa

Reply via email to