Hi,

i'm currently struggling with the DBpedia 3.7 dumps, but as the DBpedia 3.8 
seems to be on the road i thought i'd let you know about some of the problems i 
encountered by now which make it tricky to work with the dumps.

I downloaded the 3.7 all_languages.tar and the all_languages-i18n.tar and the 
interlang-i18n.tar .

The problems i encountered are mainly related to the i18n dataset, but can also 
be found in the interlanguage links of the "traditional" dump (in the 
all_languages.tar):

It seems as if the de, el and ru wikipedia were exported in a different way 
from other languages, as their encodings are different and they have the 
language prefix in the URIs (de.dbpedia.org). They use UTF-8 IRIs in the dump 
files, while all other languages i tried use % escaped URIs and don't have a 
language prefixed URI.

This leads to a couple of encoding issues (.nt files are ASCII only and 
normally use % encoding, but de, el and ru contain UTF-8 here) and 
inconsistencies (e.g., interlanguage links to fr.dbpedia.org pointing into 
nothing).
Also the interlanguage link files show different ways of encoding within the 
same file, making them tricky to load.


Details where you can see these issue: (the following is all related to the 
3.7-i18n dumps!)
=========================================================================

File Encoding:
Mainly the de, el and ru, as well as the interlanguage_link dumps contain .nt 
files with UTF-8 encoding (for IRIs). This isn't valid .nt, so maybe consider 
renaming all those files to .n3?
Example:
from de/labels_de.nt.bz2:
<http://de.dbpedia.org/resource/Anschlussfähigkeit> 
<http://www.w3.org/2000/01/rdf-schema#label> "Anschlussf\u00E4higkeit"@de .
from fr/lables_fr.nt.bz2:
<http://dbpedia.org/resource/Alg%C3%A8bre_g%C3%A9n%C3%A9rale> 
<http://www.w3.org/2000/01/rdf-schema#label> "Alg\u00E8bre 
g\u00E9n\u00E9rale"@fr .

Encoding & escaping:
The de, el and ru IRIs seem to be UTF-8 encoded and () seem to be "unescaped".
All other languages I tried seem to use % encoded URIs and escape the brackets 
with %28 / %29.
Example:
from interlang-i18n/el/interlanguage_links_el.nt.bz2
<http://el.dbpedia.org/resource/Ταυ> <http://www.w3.org/2002/07/owl#sameAs> 
<http://de.dbpedia.org/resource/Tau_(Buchstabe)> .
<http://el.dbpedia.org/resource/Ταυ> <http://www.w3.org/2002/07/owl#sameAs> 
<http://dbpedia.org/resource/Tau> .
<http://el.dbpedia.org/resource/Ταυ> <http://www.w3.org/2002/07/owl#sameAs> 
<http://nl.dbpedia.org/resource/Tau_%28letter%29> .
<http://el.dbpedia.org/resource/Ψι> <http://www.w3.org/2002/07/owl#sameAs> 
<http://fr.dbpedia.org/resource/Psi_%28lettre_grecque%29> .
<http://el.dbpedia.org/resource/Ψι> <http://www.w3.org/2002/07/owl#sameAs> 
<http://ru.dbpedia.org/resource/Пси_(буква)> .

Inter language links:
en and de just link to each other and el.
( minor: interlanguage_links_en.n3.bz2 seems to have 3082 triples of this form: 
?s owl:sameAs ?s )
All other languages seem to have owl:sameAs to all languages.

Prefixing dbpedia.org with language codes:
de, el and ru files contain URIs like de.dbpedia.org, el.dbpedia.org, 
ru.dbpedia.org.
fr, es, nl, ... (all others i tried) lack the prefixes in all data files i 
tried, but the interlanguage_links_en.n3 show them.
This leads to all interlanguage links pointing to http://fr.dbpedia.org/... 
point to nothing, as the data is at http://dbpedia.org/...
Example:
from fr/lables_fr.nt.bz2:
<http://dbpedia.org/resource/Alg%C3%A8bre_g%C3%A9n%C3%A9rale> 
<http://www.w3.org/2000/01/rdf-schema#label> "Alg\u00E8bre 
g\u00E9n\u00E9rale"@fr .
from interlang-i18n/fr/interlanguage_links_fr.nt.bz2:
<http://fr.dbpedia.org/resource/Alg%C3%A8bre_g%C3%A9n%C3%A9rale> 
<http://www.w3.org/2002/07/owl#sameAs> 
<http://de.dbpedia.org/resource/Abstrakte_Algebra> .
<http://fr.dbpedia.org/resource/Alg%C3%A8bre_g%C3%A9n%C3%A9rale> 
<http://www.w3.org/2002/07/owl#sameAs> 
<http://dbpedia.org/resource/Abstract_algebra> .

I'd happily offer a hand to help fixing these issues in 3.8.

Cheers,
Jörn

PS: yes, i just wanted to update 
http://joernhees.de/blog/2010/10/31/setting-up-a-local-dbpedia-mirror-with-virtuoso/


------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Dbpedia-discussion mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion

Reply via email to