Re: HTML 4 Profile for RDFa

Philip Taylor Thu, 14 May 2009 16:59:41 -0700

Shane McCarron wrote:

Philip Taylor wrote:
Indeed, it would be good have this defined with the level of precisionthat HTML 5 has, so we can be sure implementations will be able toagree on how to extract RDFa from text/html content.
[...]
Well - remember that the document you are looking at is written in thecontext of HTML 4. In HTML 4 none of what you say above makes anysense. Attributes are tokens - and the token "xml:lang" is what I wastalking about.

Yeah, I'm not sure what else you could do in the context of HTML 4. I'mapproaching this from the context of HTML 5 - I think it would bevaluable to define precisely the mapping from text/html to RDF triples,so that people can know what to expect when they run their contentthrough any RDFa-aware tool, and it only seems to be feasible to definethat in the HTML 5 context.

(This might be in addition to an HTML 4 extension like in your document,not necessarily a replacement, but I'm not personally interested inworking with HTML 4. Maybe that means the "HTML 4 Profile for RDFa"thread is not the best place to discuss this, but better here thannowhere...)

[Stuff about XMLLiterals]
We have no presumption of how an RDFa processor is implemented. Itmight be client side via a browser. It might be server side. It mightbe part of an XML tool-chain. It doesn't really matter.

Is there any implementation that is *not* based on some kind of abstractdocument model (like DOM or SAX or some custom tree structure, wheredocuments are parsed into elements and attributes before any furtherprocessing)?

It seems to me that requiring the abstract document model to bere-serialised into well-formed XML (regardless of whether it originatedfrom an XML parser, or from parsing HTML with missing quotes andunclosed <br>s, or from a DOM API, or anywhere else) would be the bestway to ensure correctness (since the output will always be well-formedXML, by definition), functionality (since it would let you useXMLLiterals in text/html with few surprises or special cases), andpractical implementability (since everyone should already have a tree ofelements and attributes and be able to serialise it into XML). But thatdoes rely the concept of a document model, which only really exists inHTML 5 and not in HTML 4.

I think you need to take a step back and think aboutgoals rather than implementation strategies. The goal here is that allimplementations extract the same collection of triples from a givendocument.

I like that goal :-). I don't want to limit things to a singleimplementation strategy (e.g. DOM) - but some people will use thatimplementation strategy, and if other implementations are required toextract the same collection of triples, then it seems sensible to definethe requirements in a way that can be easily mapped onto thatimplementation strategy (and preferably onto others), rather thanleaving a huge gap that implementers have to sort out themselves andcould easily get wrong.

The DOM-based model used by the HTML 5 parsing algorithm can be easilymapped onto common implementation strategies (DOM, SAX, XOM,ElementTree, etc). The token-based model of HTML 4 can't (hence thecrazy incompatibilities between HTML parsers, and the need for a hugeamount of work in HTML 5 to define the mapping for the first time). Sodefining RDFa triple extraction based on HTML 5 seems much more likelyto achieve the goal than defining it based on HTML 4, and thereforeseems a more useful thing to work on.

<http://www.whatwg.org/specs/web-apps/current-work/multipage/the-xhtml-syntax.html#serializing-xhtml-fragments>)
In HTML 5, where there is an XML serialisation method, that might makesense. In HTML 4 however, we don't have that luxury.

Sounds like another benefit of defining RDFa-in-HTML based on HTML 5instead of suffering the restrictions of HTML 4 :-)

[...] in the HTML profile I think itwould be reasonable to require that prefix names are mapped tolower-case during processing. Or some other solution that gets us tothe point where a browser-based implementation that requests attributenames from a DOM node can still work. My conclusion here is that prefixnames should be treated case-insensitively in the HTML profile. Do youagree?

HTML parsers (by which I mean HTML 5 and web browsers) don't preservethe case of element or attribute names. Anything processing the outputfrom a parser will see everything as lowercase (or uppercase, dependingon what API they use), so RDFa mustn't consider the case of attributenames to be significant. When comparing prefixes in CURIEs, I suppose itcould do a case-insensitive comparison, but that would be unnecessarycomplexity and annoyingly inconsistent with XHTML. rdfquery andhttp://www.w3.org/2006/07/SWD/RDFa/impl/js/ appear to treat theattribute name as lowercase, and then case-sensitively compare againstthe CURIE prefix.

Should the same processing rules be used for documents from both HTMLand XHTML parsers, or would DOM-based implementations need to detectwhere the input came from and switch processing rules accordingly? Ifthere is a difference, what happens if I adoptNode from an XHTMLdocument into an HTML document, or vice versa?
Err... What's adoptNode?


http://www.w3.org/TR/DOM-Level-3-Core/core.html#Document3-adoptNode

And how are these two documents getting together?

I might have an HTML document (containing some RDFa), which usesXMLHttpRequest to download an XHTML fragment (also containing some RDFa)and inserts it into the current page, and then I might attempt toextract RDF triples from the page.

I mean, that's sort of out of scope of an HTML 4 profile for RDFa.

It's out of scope for HTML 4, but it seems necessary for the goal that"all implementations extract the same collection of triples from a givendocument" if you include dynamic implementations. (Andhttp://rdfa.info/wiki/Dynamic-content-parsing suggests people areinterested in dynamic implementations.)

With regard to the first part of the question, I believe the sameprocessing rules can be used.

So I could use the "lang" attribute (instead of "xml:lang") in XHTMLdocuments as well as HTML, because the same processing rules would beapplied? (If so, it would be nice if the RDFa-in-XHTML specificationagreed with that.)


--
Philip Taylor
pj...@cam.ac.uk

Re: HTML 4 Profile for RDFa

Reply via email to