Re: Broken Links in LOD Data Sets

Hausenblas, Michael Sat, 14 Feb 2009 08:34:20 -0800

Hugh,

As often, you are right (with my sloppy usage of the term publisher)and I think your analysis below is indeed close to what I was thinkingas well. Let's move over to ESW Wiki and write up stuff. A paste fromyour email might be a good start! Mind minting a URI for it and startfill in the Wiki page? I'm on travel and limited re my capabilitiescurrently ;)


Cheers, Michael

Sent from my iPhone

On 14 Feb 2009, at 16:00, "Hugh Glaser" <[email protected]> wrote:

Hi Michael.
I got thoroughly confused, I think, by your use of the "datasetpublisher
(the authoritative one who 'owns' it)".
That made me think you were talking about the owner of the brokenURI (ie,where it should have resolved to), rather than the place that gaveyou the
URI. (Which was it? :-) )

So the next bit is the first of those:
======================================
I think in a lot of the LOD world, a 404 means ³I don¹t know anything about
that URI², rather than a broken link.
Certainly for us, that is all we can do.
In fact, what we are actually doing is manually generating the 404when wefind there is nothing in the KB; we could instead return a blankishRDF
document, but that didn¹t seem sensible.
Now I think about it, I have checked what dbpedia does to
http://dbpedia.org/resource/Esperanta  it does the blank doc thing.
(I guess we need to work out what is best practice for this and thenadd it
to the How to Publish? I think my view is that something like
http://dbpedia.org/data/Esperanta.rdf should 404.)
So either way, in LOD sites of the sort that have DBs or KBs behindthem,either it is not possible to get a 404 (dbpedia), or you can¹t distinguishbetween a rubbish URI that might have been generated and one youwant to
know about.
I find the idea that I might give people the expectation that I willcreatetriples (as your point 2) rather strange - if I knew triples I wouldhaveserved them in the first place. Of course if we consider a URI Idon't knowas a request for me to go and find knowledge about it, fair enough,but Iwould expect a more explicit service for that. In that sense itwould not be
a "broken link".
Maybe the world is different for the other RDFa etc ways ofpublishing LD,but in the DB/KB world, I don't see broken incoming links assomething that
can be usefully dealt with, other than the maintainer checking what is
happening, as you do with a normal site.
======================================

Now turning to the second possible meaning.
We are concerned with the place that gave you the URI, which ispossibly
more interesting. And I think this is actually the case for your TAG
example.
If I gave you (by which I mean an agent) such a link and youdiscovered itwas broken, it would be helpful to me and the LOD world if you couldtell meabout it, so I could fix it. In fact it would also be helpful if youhad a
suggestion as to the fix (ie a better URI), which is not out of the
question. And if I trust you (when we understand what that means), Imight
even do a replacement or some equivalent triples without further
intervention.

======================================
In the case of our RKB system, we actually do something like thisalready.If we find that there is nothing about a URI in the KB that shouldhave it,
we don't immediately return 404, but look it up in the associated CRS
(coreference service), and possibly others, to see if there is anequivalentURI in the same KB that could be used (we do not return RDF fromother KB,
although we could). So if you try to resolve
http://southampton.rkbexplorer.com/description/person-07113
You actually get the data for
http://southampton.rkbexplorer.com/id/person-0a36cf76d1a3e99f9267ce3d0b95e42
e-06999d58799cb8a3a55d3c69efcc9ba6 and a message telling you to usethe new
one next time.
(I'm not sure we have got the RDF perfectly right, but that is theidea.)In effect, if we are asked for a broken link, we have a quick lookaround to
see if there is anything we do know, and give that back.
Of course, the CRS also gives the requestor the chance to do thesame fixing
up.
The reason that there might be a URI in the KB that has no triples,but weknow about, is because we "deprecate" URIs to reduce the number, andthen
use the CRS to resolve from deprecated to non-deprecated.
So a deprecated URI is one we used to know about, and may still bebeingused "out there", but don't want to continue to use - sort of abroken link.
Hence our dynamic broken link fixing.

Best
Hugh

PS.
My choice of http://dbpedia.org/data/Esperanta.rdf as a misspelling of
http://dbpedia.org/data/Esperanto.rdf turned out to be fascinating.
It turns out that wikipedia tells me that there used to be a page
http://en.wikipedia.org/wiki/Esperanta, but it has been deleted.
So what is returned is different from
http://en.wikipedia.org/wiki/Esperanti.
Although http://dbpedia.org/data/Esperanta.rdf and
http://dbpedia.org/data/Esperanti.rdf both return empty RDFdocuments, I
think.
It looks to me that this is trying to solve a similar problem tothat which
our deprecated URIs is doing in our CRS.
On 14/02/2009 14:06, "Hausenblas, Michael" <[email protected]>
wrote:
Kingsley,

Grounding in 404 and 30x makes sense to me. However I am still in the
conception phase ;)

Sent from my iPhone
On 12 Feb 2009, at 14:02, "Kingsley Idehen"<[email protected]> wrote:
Michael Hausenblas wrote:
Bernhard, All,
So, another take on how to deal with broken links: couple of daysago Ireported two broken links in a TAG finding [1] which was (quicklyand
pragmatically, bravo, TG!) addressed [2], recently.
Let's abstract this away and apply to data rather than documents.The
mechanism could work as follows:
1. A *human* (e.g. Through a built-in feature in a Web of Databrowser suchas Tabulator) encounters a broken link an reports it to therespective
dataset publisher (the authoritative one who 'owns' it)

OR
1. A machine encounters a broken link (should it then directlyping the
dataset publisher or first 'ask' its master for permission?)
2. The dataset publisher acknowledges the broken link and createsaccording
triples as done in the case for documents (cf. [2])
In case anyone wants to pick that up, I'm happy to contribute.The name?Well, a straw-man proposal could be called *re*pairing *vi*ntagelink
*val*ues (REVIVAL) - anyone? :)

Cheers,
     Michael

[1] http://lists.w3.org/Archives/Public/www-tag/2009Jan/0118.html
<http://lists.w3.org/Archives/Public/www-tag/2009Jan/0118.html>
[2] http://lists.w3.org/Archives/Public/www-tag/2009Feb/0068.html
<http://lists.w3.org/Archives/Public/www-tag/2009Feb/0068.html>
Micheal,
If the publisher is truly dog-fooding and they know what dataobjects
they are publishing, condition 404 should be the trigger for a self
directed query to determine:

1. what's happened to the entity URI
2. lookup similar entities
3. then self fix if possible (e.g. a 302)
Basically, Linked Data publishers should make 404s another LinkedData
prowess exploitation point  :-)


--


Regards,
Kingsley Idehen Weblog: http://www.openlinksw.com/blog/~kidehen
<http://www.openlinksw.com/blog/~kidehen>
President & CEO
OpenLink Software     Web: http://www.openlinksw.com
<http://www.openlinksw.com>

Re: Broken Links in LOD Data Sets

Reply via email to