On Thu, Sep 17, 2009 at 7:23 AM, Kingsley Idehen <kide...@openlinksw.com>wrote:
> > > This is basically an aspect of the whole Linked Data meme that is lost on > too many. > > I've got to thank the book by Allemang and Hendler http://www.amazon.com/Semantic-Web-Working-Ontologist-Effective/dp/0123735564 for setting me straight about data modeling in RDF. RDFS and OWL are based on a system of duck typing that turns conventional object or object-relational thinking inside out. It's not necessarily good or bad, but it's really different. Even though types matter, predicates come before types because using predicate A can make object B become a member of type C, even if A is never explicitly put in class C. Looking at the predicates in RDFS or OWL and not understanding the whole, it's pretty easy to be like "oh, this isn't too different from a relational database" and miss the point that RDFS&OWL is much more about inference (creating new triples) than it is about constraints or the physical layout of the data. One consequence of this is that using an existing predicate can drag in a lot more baggage than you might want; it's pretty easy to get the inference engine to infer too much, and false inferences can snowball like a katamari. A lot of people are in the habit of reusing vocabularies and seem to forget that the natural answer to most RDF modeling problems is to create a new predicate. OWL has a rich set of mechanisms that can tell systems that x A y -> x B y where A is your new predicate and B is a well-known predicate. Once you merge two "almost-but-not-the-same" things by actually using the same predicate, it's very hard to fix the damage. If you use inference, it's easy to change your mind. -------------- It may be different with other data sets, but data cleaning is absolutely essential working with dbpedia if you want to make production-quality systems. For instance, all of the time people build bizapps and they need a list of US states... Usually we go and cut and paste one from somewhere... But now I've got dbpedia and I should be able to do this systematically. There's a category in wikipedia for that... http://en.wikipedia.org/wiki/Category:States_of_the_United_States if you ignore the subcategories and just take the actual pages, it's (almost) what you need, except for some weirdos like User:Beebarose/Alabama <http://en.wikipedia.org/wiki/User:Beebarose/Alabama> and one state that's got a disambiguator in the name: Georgia (U.S. state) <http://en.wikipedia.org/wiki/Georgia_%28U.S._state%29> It's not hard to clean up this list, but it takes some effort, and ultimately you're probably going to materialize something new. These sorts of issues even turn up in highly clean data sets. Once I built a webapp that had a list of countries in it, this was used to draw a dropdown list, but the dropdown list was excessively wide, busting the layout of the site. Now, the list was really long because there were a few authoritarian countries with long and flowery names. The transformation from *Democratic People's Republic of Korea -> North Korea *improved the usability of the site while eliminating Orwellian language. This kind of "fit and finish" is needed to make quality sites, and semweb systems are going to need automated and manual ways of fixing this so that "Web 3.0" looks like a step forward, not a step back.