Philip Taylor wrote:
Indeed, it would be good have this defined with the level of precision
that HTML 5 has, so we can be sure implementations will be able to
agree on how to extract RDFa from text/html content.
A few significant issues that I see in the current version:
What is "the @xml:lang attribute"? Is it the attribute with local name
"xml:lang" in no namespace (as would be produced by an HTML 5 parser
(and by current HTML browser parser implementations))? or the
attribute with local name "lang" in the namespace
"http://www.w3.org/XML/1998/namespace" (as would be produced by an XML
parser, and could be inserted in an HTML document via DOM APIs)? or
both (in which case both could be specified on one element, in
addition to "lang" in no namespace)?
Well - remember that the document you are looking at is written in the
context of HTML 4. In HTML 4 none of what you say above makes any
sense. Attributes are tokens - and the token "xml:lang" is what I was
talking about. In HTML 4 those attribute names are case-insensitive -
I need to add something about that to the draft. Thanks for the reminder!
"If the object of a triple would be an XMLLiteral, and the input to
the processor is not well-formed [XML]" - I don't understand what that
means in an HTML context. Is it meant to mean something like "the
bytes in the HTML file that correspond to the contents of the relevant
element could be parsed as well-formed XML (modulo various namespace
declaration issues)"? If so, that seems impossible to implement. The
input to the RDFa processor will most likely be a DOM, possibly
manipulated by the DOM APIs rather than coming straight from an HTML
parser, so it may never have had a byte representation at all.
We have no presumption of how an RDFa processor is implemented. It
might be client side via a browser. It might be server side. It might
be part of an XML tool-chain. It doesn't really matter. In this case,
the document I wrote is a little too fuzzy because the idea is not
completely cooked yet. Here's the problem: RDFa permits the creation
of objects that are of type XMLLiteral. That datatype is tightly
defined, and as you can imagine it is expected to contain well-formed
XML. If a Conforming RDFa Processor were to generate triples that
contained data of type XMLLiteral, and that data were not "well-formed"
as defined in XML, then consumers of that data could easily be very
surprised!
Even without scripting, there isn't always a contiguous sequence of
bytes corresponding to the content of an element. E.g. if the HTML
input is:
<table>
<tr some-attributes-to-say-this-element-outputs-an-XMLLiteral>
<td> This text goes inside the table </td>
This text gets parsed to *outside* the table
<td> This text goes inside the table </td>
</tr>
</table>
then (according to the HTML 5 parsing algorithm, and implemented in
(at least) Firefox) the content of the <tr> element includes the first
and third lines of text, but not the second. How would you decide
whether the content is well-formed XML?
Yeah - tricky. I think you need to take a step back and think about
goals rather than implementation strategies. The goal here is that all
implementations extract the same collection of triples from a given
document. There are a lot of ways to achieve that. In the XHTML
profile of RDFa we relied upon the XML parsing model. Consequently, we
are confident that we are being handed well-formed content. If you do
an implementation via the DOM, you can also be confident of that, since
by the time content gets into the DOM you can assume the processor has
done whatever magic was necessary and you have a node tree that you
could turn back into content that would be well-formed. If you are
writing your own parser that sifts through a document character by
character... well, you are going to have some work ahead of you!
With regard to your example above.... if I had a DOM based processor, I
would have missed out on line 3 I imagine. If I wrote my own I would
have included it ('cause that is well formed - the XML parser would have
handed it to me). In the XHTML profile we (sort of ) address this in
that we only tightly constrain behavior for *valid* content. The
content above is *invalid* according to the XHTML+RDFa schema - so while
the behavior of existing implementations might be inconsistent, I
personally won't get too excited about it.
In the HTML profile of RDFa, things are much the same. We can attempt
to be very very precise about how the parsing of the content should be
handled, or we can rely upon the parsing model spelled out by the
underlying specification (HTML 4 in this case). Now, I am sure you will
agree that HTML 4 does a pretty poor job of defining the parsing model,
but.... is it adequate for our needs in this instance? My belief is
that it is adequate - at least for the vast majority of the RDFa
processing rules. In particular in that most implementors will rely
upon existing parsing libraries, and the problems associated with that
parsing have been largely sorted out over the years. Even to the point
that they are being codified in the early draft HTML5 documents.
The only place I have a concern is with regard to creating XMLLiterals.
This is a very powerful aspect of RDFa, and I am loathe to disable it in
the HTML profile if I don't have to. Instead, I would like to identify
a light-weight model that implementors can use. For example, we could
say that if an object is of type XMLLiteral, then its content is escaped
so that there is no markup (< to < etc). This would mean that it is
"well formed XML", and that it could be turned back into its original
source form, which is the goal of such content. However, it would also
mean that a consumer of such content would need to know this and do the
reverse transformation before using the content. I don't know what the
right answer is - maybe we can figure it out together?
For this to make sense in real HTML implementations, the definition
should be in terms of the document layer rather than the byte layer.
(The XMLLiteral should be an XML-fragment serialisation of the
element, and some error handling (like ignoring the triple) would
occur if it's impossible to serialise as XML, similar to the
requirements in
<http://www.whatwg.org/specs/web-apps/current-work/multipage/the-xhtml-syntax.html#serializing-xhtml-fragments>)
In HTML 5, where there is an XML serialisation method, that might make
sense. In HTML 4 however, we don't have that luxury. I suppose we
could say that the HTML 4 content is transformed into corresponding
XHTML 1.0 content... but there are no reliable serializers out there
that do that really.
How are xmlns:* attributes meant to be processed? E.g. what is the
expected output in the following cases:
<div xmlns:T="test:">
<span typeof="t:x" property="t:y">Test</span>
</div>
<div XMLNS:t="test:">
<span typeof="t:x" property="t:y">Test</span>
</div>
<div xmlns:T="test:">
<span typeof="T:x" property="T:y">Test</span>
</div>
<div xmlns:t="test:">
<div xmlns:t="">
<span typeof="t:x" property="t:y">Test</span>
</div>
</div>
<div xmlns:t="test1:" id="d">
<span typeof="t:x" property="t:y">Test</span>
</div>
<script>
document.getElementById('d').setAttributeNS(
'http://www.w3.org/2000/xmlns/', 'xmlns:t', 'test2:');
/* (now the element has two distinct attributes,
each in different namespaces) */
</script>
I had not thought about this much before. Attribute names in HTML /
SGML are case-insensitive. CURIE prefix names are of course NOT.
However, I can almost guarantee you that browser-based implementations
of the XHTML profile right now would fail to work correctly when faced
with CURIE prefixes that differ only in case. Interesting point - I am
going to test that later.
I think we would be wise to advise document authors to not define
prefixes that differ only in case. And in the HTML profile I think it
would be reasonable to require that prefix names are mapped to
lower-case during processing. Or some other solution that gets us to
the point where a browser-based implementation that requests attribute
names from a DOM node can still work. My conclusion here is that prefix
names should be treated case-insensitively in the HTML profile. Do you
agree?
Should the same processing rules be used for documents from both HTML
and XHTML parsers, or would DOM-based implementations need to detect
where the input came from and switch processing rules accordingly? If
there is a difference, what happens if I adoptNode from an XHTML
document into an HTML document, or vice versa?
Err... What's adoptNode? And how are these two documents getting
together? I mean, that's sort of out of scope of an HTML 4 profile for
RDFa. With regard to the first part of the question, I believe the same
processing rules can be used. I have an implementation that does it
now. So do lots of other people. My implementation is DOM based
though, so that makes it relatively simple to have the same rules work.
Thanks for your comments!
--
Shane P. McCarron Phone: +1 763 786-8160 x120
Managing Director Fax: +1 763 786-8180
ApTest Minnesota Inet: sh...@aptest.com