Shane McCarron wrote:
Philip Taylor wrote:
Indeed, it would be good have this defined with the level of precision
that HTML 5 has, so we can be sure implementations will be able to
agree on how to extract RDFa from text/html content.
[...]
Well - remember that the document you are looking at is written in the
context of HTML 4. In HTML 4 none of what you say above makes any
sense. Attributes are tokens - and the token "xml:lang" is what I was
talking about.
Yeah, I'm not sure what else you could do in the context of HTML 4. I'm
approaching this from the context of HTML 5 - I think it would be
valuable to define precisely the mapping from text/html to RDF triples,
so that people can know what to expect when they run their content
through any RDFa-aware tool, and it only seems to be feasible to define
that in the HTML 5 context.
(This might be in addition to an HTML 4 extension like in your document,
not necessarily a replacement, but I'm not personally interested in
working with HTML 4. Maybe that means the "HTML 4 Profile for RDFa"
thread is not the best place to discuss this, but better here than
nowhere...)
[Stuff about XMLLiterals]
We have no presumption of how an RDFa processor is implemented. It
might be client side via a browser. It might be server side. It might
be part of an XML tool-chain. It doesn't really matter.
Is there any implementation that is *not* based on some kind of abstract
document model (like DOM or SAX or some custom tree structure, where
documents are parsed into elements and attributes before any further
processing)?
It seems to me that requiring the abstract document model to be
re-serialised into well-formed XML (regardless of whether it originated
from an XML parser, or from parsing HTML with missing quotes and
unclosed <br>s, or from a DOM API, or anywhere else) would be the best
way to ensure correctness (since the output will always be well-formed
XML, by definition), functionality (since it would let you use
XMLLiterals in text/html with few surprises or special cases), and
practical implementability (since everyone should already have a tree of
elements and attributes and be able to serialise it into XML). But that
does rely the concept of a document model, which only really exists in
HTML 5 and not in HTML 4.
I think you need to take a step back and think about
goals rather than implementation strategies. The goal here is that all
implementations extract the same collection of triples from a given
document.
I like that goal :-). I don't want to limit things to a single
implementation strategy (e.g. DOM) - but some people will use that
implementation strategy, and if other implementations are required to
extract the same collection of triples, then it seems sensible to define
the requirements in a way that can be easily mapped onto that
implementation strategy (and preferably onto others), rather than
leaving a huge gap that implementers have to sort out themselves and
could easily get wrong.
The DOM-based model used by the HTML 5 parsing algorithm can be easily
mapped onto common implementation strategies (DOM, SAX, XOM,
ElementTree, etc). The token-based model of HTML 4 can't (hence the
crazy incompatibilities between HTML parsers, and the need for a huge
amount of work in HTML 5 to define the mapping for the first time). So
defining RDFa triple extraction based on HTML 5 seems much more likely
to achieve the goal than defining it based on HTML 4, and therefore
seems a more useful thing to work on.
<http://www.whatwg.org/specs/web-apps/current-work/multipage/the-xhtml-syntax.html#serializing-xhtml-fragments>)
In HTML 5, where there is an XML serialisation method, that might make
sense. In HTML 4 however, we don't have that luxury.
Sounds like another benefit of defining RDFa-in-HTML based on HTML 5
instead of suffering the restrictions of HTML 4 :-)
[...] in the HTML profile I think it
would be reasonable to require that prefix names are mapped to
lower-case during processing. Or some other solution that gets us to
the point where a browser-based implementation that requests attribute
names from a DOM node can still work. My conclusion here is that prefix
names should be treated case-insensitively in the HTML profile. Do you
agree?
HTML parsers (by which I mean HTML 5 and web browsers) don't preserve
the case of element or attribute names. Anything processing the output
from a parser will see everything as lowercase (or uppercase, depending
on what API they use), so RDFa mustn't consider the case of attribute
names to be significant. When comparing prefixes in CURIEs, I suppose it
could do a case-insensitive comparison, but that would be unnecessary
complexity and annoyingly inconsistent with XHTML. rdfquery and
http://www.w3.org/2006/07/SWD/RDFa/impl/js/ appear to treat the
attribute name as lowercase, and then case-sensitively compare against
the CURIE prefix.
Should the same processing rules be used for documents from both HTML
and XHTML parsers, or would DOM-based implementations need to detect
where the input came from and switch processing rules accordingly? If
there is a difference, what happens if I adoptNode from an XHTML
document into an HTML document, or vice versa?
Err... What's adoptNode?
http://www.w3.org/TR/DOM-Level-3-Core/core.html#Document3-adoptNode
And how are these two documents getting together?
I might have an HTML document (containing some RDFa), which uses
XMLHttpRequest to download an XHTML fragment (also containing some RDFa)
and inserts it into the current page, and then I might attempt to
extract RDF triples from the page.
I mean, that's sort of out of scope of an HTML 4 profile for RDFa.
It's out of scope for HTML 4, but it seems necessary for the goal that
"all implementations extract the same collection of triples from a given
document" if you include dynamic implementations. (And
http://rdfa.info/wiki/Dynamic-content-parsing suggests people are
interested in dynamic implementations.)
With regard to the first part of the question, I believe the same
processing rules can be used.
So I could use the "lang" attribute (instead of "xml:lang") in XHTML
documents as well as HTML, because the same processing rules would be
applied? (If so, it would be nice if the RDFa-in-XHTML specification
agreed with that.)
--
Philip Taylor
pj...@cam.ac.uk