Danny Ayers wrote: > On 28/02/07, Stefano Mazzocchi <[EMAIL PROTECTED]> wrote: > >> Guys, please, let's color the bikeshed another time. > > Point taken. For what it's worth, I don't actually disagree with Ben, > RDFa does seem a useful way of publishing RDF. > > There are a couple of tangible points nearby though that I ought to mention. > >> There is really no point in arguing which approach is better to RDFize >> information: if it works for you, great, if not use something else that >> does. And if nothing does, create your own. > > Agreed, with one reservation. I think it's useful to make the > distinction between scraping and parsing. I just posted to the > microformats list on this point [1]. In that context and whatever the > extraction mechanism used, if a HTML document includes a profile URI > (as described in the HTML spec) the extracted data is known to follow > the publisher's intent; without a profile URI we have to make the > assumption the data is what the publisher meant.
In theory, sure. In practice, writing an HTML parser should just be a matter of reading the spec and implement it, right? My point is that out in the wild, believing that people will adhere to anything specified just because it would make *your* consumer life easier it's utterly naive. Microformats (or RDF ontologies) for that matter, are cursed to exhibit the 'babel syndrome': usage and feedback is what stabilizes a producer/consumer cycle, not standardization. That leads to scale-free distributions (aka long tail). Personally, I think that parsing has a closed-world taste to it, while scraping has an open-world twist: the first camp thinks that validation practices can be put forth to make the data production more coherent, the second will just cope with whatever happens (and would benefit from any success of the first group anyhow). Call me politically disillusioned or call me realist but I believe that those who think that coherence is a necessary condition for data interoperability are setting themselves up for a big disappointment. >> Personally, I dislike GRDDL as long as it keeps ties to XSLT, as XSLT is >> a *horrible* way to write RDFizers compared to, say, javascript. [and >> it's not for lack of XSLT knowledge that I say so] > > I don't disagree about XSLT being hard work, but GRDDL isn't formally > tied to it. Again, theory vs. practice. I applaud efforts to come up with solutions that reduce the gap between data publishing and RDF publishing, but without taking into consideration the practical implications or today's technological limitations and boundaries (or, worse, the socio-economical aspects associated to it), it feels ivory-towerish. > XSLT is only one way of expressing the transformation > algorithm. The only workable one in practice today. > One reason the docs are full of it is convenience - it's > an easy fit for the document -process-> RDF kind of pipeline, just > give the "process" stylesheet a URI and you're done (another reason > was the existing rdf-in-html material using XSLT). I like convenience, it speaks of practice. And I like incremental design, it speaks of evolution. What I don't like is overdesign or obscure abstractions as a way to overcome, in theory, practical limitations, but without showing an incremental way to enable them. I've been part of several expert groups and I know that design-by-committee is to blame for such results and not the lack of collective intelligence of the group: it's like evolution without adaptation feedback. > As the spec puts it: > [[ > While technically Javascript, C, or virtually any other programming > language may be used to express transformations for GRDDL, XSLT is > specifically designed to express XML to XML transformations and has > some good safety characteristics. > ]] > Also note there's nothing to stop an implementation seeing a > transformation URI like "http://example.org/wiki2rdf.xsl" and using > Javascript to do an equivalent transformation. nothing? how about the fact that if I express GRDDL in XSLT I already have an implicit "output" channel, while if I do it in javascript I don't? should I embed a C compiler in my GRDDL-enabled crawler so that I can recompile the code so that works on my platform? That line that you quote above is *exactly* the kind of thing that makes me kick and scream about some W3C recommendations for the patronizing "this is left as an exercise to the reader" taste it leaves. There is a difference between theory and practice, a win or lose one: in theory, GRDDL can be described with a deck of punch cards that could be read by an IBM mainframe in the 60's, but that's hardly useful if I have no way to: 1) get to the GRDDL description 2) obtain an executable representation of it 3) execute it and most importantly 4) get the resulting data *out* of the program! I'll stop consider GRDDL as just another way to apply XSLT to a web page when the above four points are explicitly addressed, not before. >> If there was a standardized object model for RDF stores in javascript >> (sort of a DOM for RDF), then you could imagine having cross-platform >> GRDDL in javascript (and yes, I'm aware that W3C wants to standardize >> that), but for now you're stuck with XSLT. > > I've not spent enough hands-on time with Javascript to know the > issues, but a standard js model does seem a very good idea. There > would be the bits and pieces around JSON to draw on, and Tabulator's > internal model, and I bet the SIMILE work has already covered most > angles. I rather strongly disagree: there are few groups that are trying things out (including us) but we have no way of knowing what works and what doesn't (therefore what's needed and what's not). Creating a working group before there is any evidence of where the problems are is a perfect way to come up on the other end with something that is so far from useful that it hurts. Case in point: XMLSchema vs. RelaxNG or, more to the topic at hand, XQuery vs. Sparql. > Whatever, it would be really good to get some examples of > Javascript-based RDFizers used with GRDDL, if you have any thoughts on > how best to do this please drop a line to the mailing list. You can't use javascript for GRDDL! as there is no way to get the data out! (in a portable way, that is). In piggy bank (and solvent and in the near future crowbar), we offer a 'data' object that the scraper uses to push the created statements onto. This is a way to collect the resulting data. Unlike XSLT, Javascript has no notion of "STDOUT". you can use document.write() and append stuff to the document, you can decide to use a particular fixed element in the <head>, say <head><data> and put your generated RDF/XML in there, anything!, but there must be agreement on where to put it or it will only work on crawlers that expect that. I cannot provide a useful GRDDL RDFizer in javascript before there is even the slightest agreement on how to get the generated data out of there. Also having the spec to agree that "theory is not practice" and that every language might require special agreements due to their own nature, would go a long way to ease my dissatisfaction and allow me to cooperate. >> So, Ben says RDFa is better because more explicit, you say GRDDL is >> better because allows you to RDFize even stuff that is not RDF to start >> with (like microformats) and I say that scraping is better than GRDDL >> because I can use a real programming language and because I don't need >> to have any RDF buy-in from the data publisher. > > Heh, quite. Although I rather like the argument that with GRDDL it > means the domain-specific stuff *is* RDF to start with, without > publisher buy-in. CustomRdfDialects [2] as Dan Connolly puts it. If you have to put even a single line of content inside a published page in somebody's web site, by definition, they have to buy into it. "Buy in "is not the same as "pay for it". They can very well use GRDDL transformers written and maintained by third parties, but GRDDL will work if they will put a link to that transformation in their pages. The act of adding that single line to the page templates is very little work in most cases. The act of convincing the data owners to spend time understanding the implications of their action is *far* from it. >> But I continue to think that having RDFa data embedded right in the page >> could be useful. > > Assuming the implementation cost isn't excessive, that seems a very good idea. > > Cheers, > Danny. > > [1] > http://microformats.org/discuss/mail/microformats-discuss/2007-February/008880.html > [2] http://esw.w3.org/topic/CustomRdfDialects Sorry if I sounded harsh, it has nothing to do with you Danny, but you accidentally stepped on one of my nerves ;-) -- Stefano Mazzocchi Digital Libraries Research Group Research Scientist Massachusetts Institute of Technology E25-131, 77 Massachusetts Ave skype: stefanomazzocchi Cambridge, MA 02139-4307, USA email: stefanom at mit . edu ------------------------------------------------------------------- _______________________________________________ General mailing list [email protected] http://simile.mit.edu/mailman/listinfo/general
