Paul A Houle wrote:


On Thu, Sep 17, 2009 at 7:23 AM, Kingsley Idehen <kide...@openlinksw.com <mailto:kide...@openlinksw.com>> wrote:



    This is basically an aspect of the whole Linked Data meme that is
    lost on too many.


I've got to thank the book by Allemang and Hendler

http://www.amazon.com/Semantic-Web-Working-Ontologist-Effective/dp/0123735564

for setting me straight about data modeling in RDF. RDFS and OWL are based on a system of duck typing that turns conventional object or object-relational thinking inside out. It's not necessarily good or bad, but it's really different. Even though types matter, predicates come before types because using predicate A can make object B become a member of type C, even if A is never explicitly put in class C.
Schema Last vs. Schema First :-) An RDF virtue that once broadly understood, across the more traditional DBMS realms, will work wonders for RDF based Linked Data appreciation.

Looking at the predicates in RDFS or OWL and not understanding the whole, it's pretty easy to be like "oh, this isn't too different from a relational database" and miss the point that RDFS&OWL is much more about inference (creating new triples) than it is about constraints or the physical layout of the data.
Its about a concrete conceptual layer that isn't autistic to context. In some quarters this is actually called a: Context Model Database [1].

One consequence of this is that using an existing predicate can drag in a lot more baggage than you might want; it's pretty easy to get the inference engine to infer too much, and false inferences can snowball like a katamari.
Yes, but the katamari can be confined to a specific data space that is owned and controlled by a particular person, who has a specific world view. As long as the axioms are partitioned across data spaces, and the RDF store is capable of processing within said confines, everyone is happy. Trouble starts when the claims become global facts imposed on everyone else that has access to the data space.

A lot of people are in the habit of reusing vocabularies and seem to forget that the natural answer to most RDF modeling problems is to create a new predicate. OWL has a rich set of mechanisms that can tell systems that

x A y -> x B y
where A is your new predicate and B is a well-known predicate. Once you merge two "almost-but-not-the-same" things by actually using the same predicate, it's very hard to fix the damage. If you use inference, it's easy to change your mind.
Yep! The trouble is that OWL-appreciation is low, but ultimately, this is where the magic really lies. This is how URIs (Data Source Names) will be distinguished based on the data highway smarts they expose etc.. Basically, I am traveling from Boston to Detroit, which route (amongst many) gets me there quickest, based on my specific preferences etc..

--------------

It may be different with other data sets, but data cleaning is absolutely essential working with dbpedia if you want to make production-quality systems.
Data cleansing is required because there are no abosolute truths and we all see the same thing differently. What RDF facilitates, above all else, is its ability to protect our natural tendencies (seeing same things differently) by inverting the tradition model where inertia is introduced as a result of different views or perspectives.

Heterogeneity is the spice of life for a reason. Even our DNA rewards us when we fuse afar (rather than inbreed) etc. :-)

For instance, all of the time people build bizapps and they need a list of US states... Usually we go and cut and paste one from somewhere... But now I've got dbpedia and I should be able to do this systematically. There's a category in wikipedia for that...

http://en.wikipedia.org/wiki/Category:States_of_the_United_States

if you ignore the subcategories and just take the actual pages, it's (almost) what you need, except for some weirdos like

User:Beebarose/Alabama <http://en.wikipedia.org/wiki/User:Beebarose/Alabama>

and one state that's got a disambiguator in the name:

Georgia (U.S. state) <http://en.wikipedia.org/wiki/Georgia_%28U.S._state%29>

It's not hard to clean up this list, but it takes some effort, and ultimately you're probably going to materialize something new.
Yes, something new, in a new data space that is still plugged into the Web.

These sorts of issues even turn up in highly clean data sets. Once I built a webapp that had a list of countries in it, this was used to draw a dropdown list, but the dropdown list was excessively wide, busting the layout of the site. Now, the list was really long because there were a few authoritarian countries with long and flowery names. The transformation from

*Democratic People's Republic of Korea -> North Korea

*improved the usability of the site while eliminating Orwellian language. This kind of "fit and finish" is needed to make quality sites, and semweb systems are going to need automated and manual ways of fixing this so that "Web 3.0" looks like a step forward, not a step back.

Web 3.0 is a step forward, but we need to know where the step is :-) As you know, It ain't about code, its about data structures combines with ubiquitous access and reference.


--


Regards,

Kingsley Idehen       Weblog: http://www.openlinksw.com/blog/~kidehen
President & CEO OpenLink Software Web: http://www.openlinksw.com





Reply via email to