Theory vs. Practice in GRDDL [was Re: RDFa]

Stefano Mazzocchi Thu, 01 Mar 2007 10:29:57 -0800

Danny Ayers wrote:
> On 28/02/07, Stefano Mazzocchi <[EMAIL PROTECTED]> wrote:
> 
>> Guys, please, let's color the bikeshed another time.
> 
> Point taken. For what it's worth, I don't actually disagree with Ben,
> RDFa does seem a useful way of publishing RDF.
> 
> There are a couple of tangible points nearby though that I ought to mention.
> 
>> There is really no point in arguing which approach is better to RDFize
>> information: if it works for you, great, if not use something else that
>> does. And if nothing does, create your own.
> 
> Agreed, with one reservation. I think it's useful to make the
> distinction between scraping and parsing. I just posted to the
> microformats list on this point [1]. In that context and whatever the
> extraction mechanism used, if a HTML document includes a profile URI
> (as described in the HTML spec) the extracted data is known to follow
> the publisher's intent; without a profile URI we have to make the
> assumption the data is what the publisher meant.


In theory, sure. In practice, writing an HTML parser should just be a
matter of reading the spec and implement it, right?

My point is that out in the wild, believing that people will adhere to
anything specified just because it would make *your* consumer life
easier it's utterly naive.

Microformats (or RDF ontologies) for that matter, are cursed to exhibit
the 'babel syndrome': usage and feedback is what stabilizes a
producer/consumer cycle, not standardization. That leads to scale-free
distributions (aka long tail).

Personally, I think that parsing has a closed-world taste to it, while
scraping has an open-world twist: the first camp thinks that validation
practices can be put forth to make the data production more coherent,
the second will just cope with whatever happens (and would benefit from
any success of the first group anyhow).

Call me politically disillusioned or call me realist but I believe that
those who think that coherence is a necessary condition for data
interoperability are setting themselves up for a big disappointment.

>> Personally, I dislike GRDDL as long as it keeps ties to XSLT, as XSLT is
>> a *horrible* way to write RDFizers compared to, say, javascript. [and
>> it's not for lack of XSLT knowledge that I say so]
> 
> I don't disagree about XSLT being hard work, but GRDDL isn't formally
> tied to it. 

Again, theory vs. practice. I applaud efforts to come up with solutions
that reduce the gap between data publishing and RDF publishing, but
without taking into consideration the practical implications or today's
technological limitations and boundaries (or, worse, the
socio-economical aspects associated to it), it feels ivory-towerish.

> XSLT is only one way of expressing the transformation
> algorithm. 

The only workable one in practice today.

> One reason the docs are full of it is convenience -  it's
> an easy fit for the document -process-> RDF kind of pipeline, just
> give the "process" stylesheet a URI and you're done (another reason
> was the existing rdf-in-html material using XSLT).

I like convenience, it speaks of practice. And I like incremental
design, it speaks of evolution.

What I don't like is overdesign or obscure abstractions as a way to
overcome, in theory, practical limitations, but without showing an
incremental way to enable them.

I've been part of several expert groups and I know that
design-by-committee is to blame for such results and not the lack of
collective intelligence of the group: it's like evolution without
adaptation feedback.

> As the spec puts it:
> [[
> While technically Javascript, C, or virtually any other programming
> language may be used to express transformations for GRDDL, XSLT is
> specifically designed to express XML to XML transformations and has
> some good safety characteristics.
> ]]
> Also note there's nothing to stop an implementation seeing a
> transformation URI like "http://example.org/wiki2rdf.xsl"; and using
> Javascript to do an equivalent transformation.

nothing? how about the fact that if I express GRDDL in XSLT I already
have an implicit "output" channel, while if I do it in javascript I
don't? should I embed a C compiler in my GRDDL-enabled crawler so that I
can recompile the code so that works on my platform?

That line that you quote above is *exactly* the kind of thing that makes
me kick and scream about some W3C recommendations for the patronizing
"this is left as an exercise to the reader" taste it leaves. There is a
difference between theory and practice, a win or lose one: in theory,
GRDDL can be described with a deck of punch cards that could be read by
an IBM mainframe in the 60's, but that's hardly useful if I have no way to:

1) get to the GRDDL description
2) obtain an executable representation of it
3) execute it

and most importantly

4) get the resulting data *out* of the program!

I'll stop consider GRDDL as just another way to apply XSLT to a web page
when the above four points are explicitly addressed, not before.

>> If there was a standardized object model for RDF stores in javascript
>> (sort of a DOM for RDF), then you could imagine having cross-platform
>> GRDDL in javascript (and yes, I'm aware that W3C wants to standardize
>> that), but for now you're stuck with XSLT.
> 
> I've not spent enough hands-on time with Javascript to know the
> issues, but a standard js model does seem a very good idea. There
> would be the bits and pieces around JSON to draw on, and Tabulator's
> internal model, and I bet the SIMILE work has already covered most
> angles.

I rather strongly disagree: there are few groups that are trying things
out (including us) but we have no way of knowing what works and what
doesn't (therefore what's needed and what's not). Creating a working
group before there is any evidence of where the problems are is a
perfect way to come up on the other end with something that is so far
from useful that it hurts. Case in point: XMLSchema vs. RelaxNG or, more
to the topic at hand, XQuery vs. Sparql.

> Whatever, it would be really good to get some examples of
> Javascript-based RDFizers used with GRDDL, if you have any thoughts on
> how best to do this please drop a line to the mailing list.

You can't use javascript for GRDDL! as there is no way to get the data
out! (in a portable way, that is).

In piggy bank (and solvent and in the near future crowbar), we offer a
'data' object that the scraper uses to push the created statements onto.
This is a way to collect the resulting data.

Unlike XSLT, Javascript has no notion of "STDOUT". you can use
document.write() and append stuff to the document, you can decide to use
a particular fixed element in the <head>, say <head><data> and put your
generated RDF/XML in there, anything!, but there must be agreement on
where to put it or it will only work on crawlers that expect that.

I cannot provide a useful GRDDL RDFizer in javascript before there is
even the slightest agreement on how to get the generated data out of there.

Also having the spec to agree that "theory is not practice" and that
every language might require special agreements due to their own nature,
would go a long way to ease my dissatisfaction and allow me to cooperate.

>> So, Ben says RDFa is better because more explicit, you say GRDDL is
>> better because allows you to RDFize even stuff that is not RDF to start
>> with (like microformats) and I say that scraping is better than GRDDL
>> because I can use a real programming language and because I don't need
>> to have any RDF buy-in from the data publisher.
> 
> Heh, quite. Although I rather like the argument that with GRDDL it
> means the domain-specific stuff *is* RDF to start with, without
> publisher buy-in. CustomRdfDialects [2] as Dan Connolly puts it.

If you have to put even a single line of content inside a published page
in somebody's web site, by definition, they have to buy into it.

"Buy in "is not the same as "pay for it". They can very well use GRDDL
transformers written and maintained by third parties, but GRDDL will
work if they will put a link to that transformation in their pages.

The act of adding that single line to the page templates is very little
work in most cases. The act of convincing the data owners to spend time
understanding the implications of their action is *far* from it.

>> But I continue to think that having RDFa data embedded right in the page
>> could be useful.
> 
> Assuming the implementation cost isn't excessive, that seems a very good idea.
> 
> Cheers,
> Danny.
> 
> [1] 
> http://microformats.org/discuss/mail/microformats-discuss/2007-February/008880.html
> [2] http://esw.w3.org/topic/CustomRdfDialects

Sorry if I sounded harsh, it has nothing to do with you Danny, but you
accidentally stepped on one of my nerves ;-)

-- 
Stefano Mazzocchi
Digital Libraries Research Group                 Research Scientist
Massachusetts Institute of Technology
E25-131, 77 Massachusetts Ave               skype: stefanomazzocchi
Cambridge, MA  02139-4307, USA         email: stefanom at mit . edu
-------------------------------------------------------------------

_______________________________________________
General mailing list
[email protected]
http://simile.mit.edu/mailman/listinfo/general

Theory vs. Practice in GRDDL [was Re: RDFa]

Reply via email to