Re: Where to put the knowledge you add

Kingsley Idehen Wed, 12 Oct 2011 12:40:18 -0700

On 10/12/11 3:08 PM, Hugh Glaser wrote:

Hi Kingsley.
My point in the post is that it is not about removing errant triples.


I didn't imply or infer that's the sole solution.

The hypothesis is that none of these triples that are not derived from the 
Wikipedia content should be in the dbpedia.org endpoint or returned as Linked 
Data from URIs.

And as I've stated repeatedly, these records (triples) shouldn't be in the <http://dbpedia.org> named partition (graph IRI) in the host Virtuoso DBMS. It just so happened that our project colleague sent a bundle to be loaded into the public instance. It got through our check due to the label "3.7 dataset for public instance". As a rule, one that's pretty hardcore inside OpenLink, datasets are pegged to origin via Graph IRIs.

If you remove the double negative :-) you get:
The dpbedia.org endpoint and associated Linked Data should be Wikipedia-derived 
data only.
And in this case that would fix the problem I raised.

We are saying the same thing. No different to any comment I've ever made to you about Virtuoso and its approach to Named Graphs.


My suggestions are:

1. NYT dataset should not be in the <http://dbpedia.org> named graph, it should be in its own partition within the Virtuoso instance (as per our own best practices in this area)


2. NYT should fix their dataset, irrespective

3. Our colleagues at Freie need to remember that <http://dbpedia.org> graph IRI is for data sets derived from Wikipedia, solely -- the point you are making too.

I hope this clears up this matter. Removing the existing triples from the public instance is part of the solution since we don't really need to reload the whole thing. SPARQUL enables this kind of cleanup :-)



Kingsley

Best
Hugh

On 12 Oct 2011, at 14:32, Kingsley Idehen wrote:

On 10/12/11 8:49 AM, glenn mcdonald wrote:

I agree with this entirely, and it's why I keep insisting that for most purposes datasets should be 
expressed using local identifiers, with all external linkages called out explicitly and/or 
externally. owl:sameAs and the use of other people's identifiers for your own nodes are equally 
dangerous. If I'm asserting that Brussels is the capital of Belgium, I'm saying that my notion of 
Brussels is my notion of "capital" of my notion of Belgium. I am the authority for that 
assertion. Saying that my notion of Brussels, "capital" or Belgium correspond with 
anybody else's notion of anything are separate assertions, for which I do not have the same 
authority.

For that matter, the proper interpretation of "correspond" depends on the purpose: for some things, 
treating "correspond" as owl:sameAs may be exactly right, and for some it might be utterly 
unacceptable. And it's much easier to map a "corresponds" property to owl:sameAs if you want to 
than to rewrite an entire dataset to undo the misapplication of IDs or owl:sameAs.

Think global, assert local.

Glenn / Hugh,

A data space admin (or authorized curator) can think globally and assert
locally, in the realm of Linked Data by doing the following:

1. Partition Datasets by Named Graph IRI
2. Make the main Dataset e.g. (DBpedia) the default graph for a given Linked
Data Space (e.g. DBpedia and its SPARQL endpoint).

What went wrong here?

Hugh: yesterday, in our private exchange, I indicated to you that we (OpenLink
Software) loaded the NYT dataset into the<http://dbpedia.org> Graph IRI which
is also the default graph of the DBpedia SPARQL endpoint. It should have been in
loaded into its own Named Graph with its own Graph IRI. After further investigation,
that wasn't 100% accurate. Here's what's happened, and its boils down to confusion
about what constitutes the DBpedia 3.7 dataset:

1. http://wiki.dbpedia.org/Downloads37 -- there are many datasets on that page, but
we loaded the lot (as has been the case in the past) into the graph
IRI<http://dbpedia.org>

2. Then when I ran my simple check via:
http://dbpedia.org/fct/rdfdesc/usage.vsp?g=http%3A%2F%2Fdata.nytimes.com%2F60370132632367982721
-- I assumed an errant load, which isn't the case since the NYT dataset was part
of the post-final-qa payload (a tarball from our colleagues at Freie) which we loaded
into the<http://dbpedia.org> graph.

Fix options:

1. We can easily remove the errant triples -- bit we will need a list so we do
this one time

2. Get NYT to fix their dataset once and for all otherwise it will be
quarantined in it own named graph and we'll keep a marker in place on it re.
future loads until its fixed.

Glenn:
Now, if you go through the archives of this mailing list, you'll see earlier
posts where I pointed out this pattern to Hugh (maybe are year to two ago). As
is really the case most of the time, your concerns are factored into what we
do, I just need to find the right language for articulating that to you :-)

Kingsley

glenn

On Wed, Oct 12, 2011 at 7:55 AM, Hugh Glaser<[email protected]> wrote:

Hi.

I have argued for a long time that the linkage data (in particular owl:sameAs
and similar links) should not usually be mixed with the knowledge being
published.

Thus, for example as I discussed with Evan for the NYTimes site a while ago, it
is not a good thing to put the owl:sameAs links (which were produced by a
relatively unskilled individual over a short period of time) at the same status
as the other data, which has been curated over decades by expert reporters.

These sameAs links have potentially very different trust, provenance, licence,
and possibly other non-functional attributes from the substantive data.
Clearly they have different trust and provenance, but licence may well be
different, as the NYT may want people to take the triples away to bring traffic
to their site, while keeping the other triples under more restricted licence.

Which brings me to an example of where things have recently gone badly wrong.
I have reported a bug to the dbpedia team wherein the URIs for countries have
become deeply intertwingled.
Example queries are at the end of this message - they have to explicitly do the
owl:sameAs because the store does not do owl:sameAs inference, but the outcome is that I
can validly infer answers such as "Maseru is the capital of Belgium".

Of course, mistakes happen, so I am not having a specific go at dbpedia, which
I still think is wonderful.

But the outcome is that I get very bad data from dbpedia.org unexpectedly,
which means I (and presumably anyone else) can't reliably use dbpedia.org at
all (because I use an inference engine when I cache the data).
Had the dbpedia.org site simply stuck to the behaviour I was sort of expecting
of publishing data from wikipedia (possibly publishing the linkage data
elsewhere) I would have been in a better position.

One of the issues here is to realise when we are actually adding knowledge to a
triplication process.
It is clear when things like owl:sameAs are added that knowledge is being added.
However, people probably consider it less clear if URIs from dbpedia or
elsewhere are directly used that they are adding their own knowledge.
In a similar way, such use introduces knowledge which may have very different
trust and provenance from the data being triplified.

Is this a good way to do things?

I would say not.
I have used a wide variety of Linked Data sources, and have found problems with
almost every one of them (possibly every significant one).
The problems frequently relate to the extra knowledge that the triplication
process has introduced.
If only I could be given the data without, then I would not have to reject the
dataset.

Thanks for reading this far.
Best
Hugh

Query:
SELECT DISTINCT ?capital WHERE {
?s owl:sameAs<http://dbpedia.org/resource/Belgium> .
?s owl:sameAs ?country .
?country<http://dbpedia.org/ontology/capital> ?capital .
}

As a URI:
http://dbpedia.org/snorql/?query=SELECT+DISTINCT+%3Fcapital+WHERE+%7B%0D%0A+%3Fs+owl%3AsameAs+%3Chttp%3A%2F%2Fdbpedia.org%2Fresource%2FBelgium%3E+.%0D%0A+%3Fs+owl%3AsameAs+%3Fcountry+.%0D%0A+%3Fcountry+%3Chttp%3A%2F%2Fdbpedia.org%2Fontology%2Fcapital%3E+%3Fcapital+.%0D%0A%7D%0D%0A

Output:
capital
http://dbpedia.org/resource/City_of_Brussels
http://dbpedia.org/resource/Maseru

--
Hugh Glaser,
Web and Internet Science
Electronics and Computer Science,
University of Southampton,
Southampton SO17 1BJ
Work: +44 23 8059 3670, Fax: +44 23 8059 3045
Mobile: +44 75 9533 4155 , Home: +44 23 8061 5652
http://www.ecs.soton.ac.uk/~hg/


--

Regards,

Kingsley Idehen 
President&  CEO
OpenLink Software
Web:
http://www.openlinksw.com

Weblog:
http://www.openlinksw.com/blog/~kidehen

Twitter/Identi.ca: kidehen



--

Regards,

Kingsley Idehen 
President&  CEO
OpenLink Software
Web: http://www.openlinksw.com
Weblog: http://www.openlinksw.com/blog/~kidehen
Twitter/Identi.ca: kidehen

smime.p7s
Description: S/MIME Cryptographic Signature

Re: Where to put the knowledge you add

Reply via email to