On 9/12/14 4:58 PM, Jörn Hees wrote:
On 12 Sep 2014, at 01:26, Kingsley Idehen <kide...@openlinksw.com> wrote:

Good place to report these matters. Bottom line, the New York Times Linked Data 
is problematic. They should be using foaf:focus where they currently use 
owl:sameAs.

I know of fixed this in the last DBpedia instance, via SPARQL 1.1. 
forward-chaining. I guess I need to make time to repeat the fix.

DBpedia Team: we need to perform this step next time around, if the New York 
Times refuse to make this important correction.

Alternatively, you can make fix dump too. Either way, this is a problem that we 
should fix.

I think it's a better idea to fix this in the dumps than only on one endpoint.

Of course.

My point is that when its fixed in the Virtuoso DBMS behind the endpoint, we then make a dump which becomes the replacement dataset for future efforts.

Links:

[1] http://kingsley.idehen.net/public_home/kidehen/Public/SPARQL-CRUD/nyt_dbpedia_mappings_fix.rq -- SPARQL 1.1 fix

Kingsley

I assume the wrong info is coming from the nytimes_links.nt.gz dump file (9678 
lines).

These are the double occurring data.nytimes.com URIs which link various wrong 
things with owl:sameAs:
(I know it's a bit dirty, but the data.nytimes.com URIs are shorter than that 
and the 2nd column is long enough that the 47 char width never 3rd column):
$ zcat nytimes_links.nt.gz | sort | uniq -D -w 47 | less
<http://data.nytimes.com/10037152102685288131> <http://www.w3.org/2002/07/owl#sameAs> 
<http://dbpedia.org/resource/Harlem> .
<http://data.nytimes.com/10037152102685288131> <http://www.w3.org/2002/07/owl#sameAs> 
<http://dbpedia.org/resource/Woods_Hole,_Massachusetts> .
<http://data.nytimes.com/10219323006478270621> <http://www.w3.org/2002/07/owl#sameAs> 
<http://dbpedia.org/resource/Colombia> .
<http://data.nytimes.com/10219323006478270621> <http://www.w3.org/2002/07/owl#sameAs> 
<http://dbpedia.org/resource/St._Louis,_Missouri> .
<http://data.nytimes.com/10943489202025116191> <http://www.w3.org/2002/07/owl#sameAs> 
<http://dbpedia.org/resource/Bar_Harbor,_Maine> .
<http://data.nytimes.com/10943489202025116191> <http://www.w3.org/2002/07/owl#sameAs> 
<http://dbpedia.org/resource/Vancouver> .
<http://data.nytimes.com/11974025787996384181> <http://www.w3.org/2002/07/owl#sameAs> 
<http://dbpedia.org/resource/Montenegro> .
<http://data.nytimes.com/11974025787996384181> <http://www.w3.org/2002/07/owl#sameAs> 
<http://dbpedia.org/resource/Orlando,_Florida> .
<http://data.nytimes.com/13330280224726436521> <http://www.w3.org/2002/07/owl#sameAs> 
<http://dbpedia.org/resource/Ann_Arbor,_Michigan> .
<http://data.nytimes.com/13330280224726436521> <http://www.w3.org/2002/07/owl#sameAs> 
<http://dbpedia.org/resource/Bucharest> .
<http://data.nytimes.com/14192138827082289301> <http://www.w3.org/2002/07/owl#sameAs> 
<http://dbpedia.org/resource/Brisbane> .
<http://data.nytimes.com/14192138827082289301> <http://www.w3.org/2002/07/owl#sameAs> 
<http://dbpedia.org/resource/Timbuktu> .
...
1102 lines

File here (18 KB):  http://www.dfki.de/~hees/nytimes_links_dups.nt.gz

Did some quick stats: each of those URIs links exactly 2 things, so we have 551 
of them which are problematic:

$ zcat nytimes_links.nt.gz | sort | uniq | cut -d' ' -f1 | sort | uniq -d | less
<http://data.nytimes.com/10037152102685288131>
<http://data.nytimes.com/10219323006478270621>
<http://data.nytimes.com/10943489202025116191>
<http://data.nytimes.com/11974025787996384181>
<http://data.nytimes.com/13330280224726436521>
<http://data.nytimes.com/14192138827082289301>
...
551 lines


This only leaves lines without the duplicate prefix
$ zcat nytimes_links.nt.gz | sort | uniq -u -w 47 | less
<http://data.nytimes.com/10014285150226506373> <http://www.w3.org/2002/07/owl#sameAs> 
<http://dbpedia.org/resource/Shane_Mosley> .
<http://data.nytimes.com/10014285150226506373> <http://www.w3.org/2002/07/owl#sameAs> 
<http://dbpedia.org/resource/Shane_Mosley> .
<http://data.nytimes.com/10028178420088332933> <http://www.w3.org/2002/07/owl#sameAs> 
<http://dbpedia.org/resource/F._Lee_Bailey> .
<http://data.nytimes.com/10040729966879859333> <http://www.w3.org/2002/07/owl#sameAs> 
<http://dbpedia.org/resource/Grace_Paley> .
<http://data.nytimes.com/10054942171853816843> <http://www.w3.org/2002/07/owl#sameAs> 
<http://dbpedia.org/resource/Jesse_McKinley> .
...
8576 lines


File here (194 KB): http://www.dfki.de/~hees/nytimes_links_dups_pruned.nt.gz


I'm not sure about the rest of that file though, given that nearly 1/10th of it 
were obviously wrong...


Cheers,
Jörn





--
Regards,

Kingsley Idehen 
Founder & CEO
OpenLink Software
Company Web: http://www.openlinksw.com
Personal Weblog 1: http://kidehen.blogspot.com
Personal Weblog 2: http://www.openlinksw.com/blog/~kidehen
Twitter Profile: https://twitter.com/kidehen
Google+ Profile: https://plus.google.com/+KingsleyIdehen/about
LinkedIn Profile: http://www.linkedin.com/in/kidehen
Personal WebID: http://kingsley.idehen.net/dataspace/person/kidehen#this


Attachment: smime.p7s
Description: S/MIME Cryptographic Signature

------------------------------------------------------------------------------
Want excitement?
Manually upgrade your production database.
When you want reliability, choose Perforce
Perforce version control. Predictably reliable.
http://pubads.g.doubleclick.net/gampad/clk?id=157508191&iu=/4140/ostg.clktrk
_______________________________________________
Dbpedia-discussion mailing list
Dbpedia-discussion@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion

Reply via email to