Hi, i'm currently struggling with the DBpedia 3.7 dumps, but as the DBpedia 3.8 seems to be on the road i thought i'd let you know about some of the problems i encountered by now which make it tricky to work with the dumps.
I downloaded the 3.7 all_languages.tar and the all_languages-i18n.tar and the interlang-i18n.tar . The problems i encountered are mainly related to the i18n dataset, but can also be found in the interlanguage links of the "traditional" dump (in the all_languages.tar): It seems as if the de, el and ru wikipedia were exported in a different way from other languages, as their encodings are different and they have the language prefix in the URIs (de.dbpedia.org). They use UTF-8 IRIs in the dump files, while all other languages i tried use % escaped URIs and don't have a language prefixed URI. This leads to a couple of encoding issues (.nt files are ASCII only and normally use % encoding, but de, el and ru contain UTF-8 here) and inconsistencies (e.g., interlanguage links to fr.dbpedia.org pointing into nothing). Also the interlanguage link files show different ways of encoding within the same file, making them tricky to load. Details where you can see these issue: (the following is all related to the 3.7-i18n dumps!) ========================================================================= File Encoding: Mainly the de, el and ru, as well as the interlanguage_link dumps contain .nt files with UTF-8 encoding (for IRIs). This isn't valid .nt, so maybe consider renaming all those files to .n3? Example: from de/labels_de.nt.bz2: <http://de.dbpedia.org/resource/Anschlussfähigkeit> <http://www.w3.org/2000/01/rdf-schema#label> "Anschlussf\u00E4higkeit"@de . from fr/lables_fr.nt.bz2: <http://dbpedia.org/resource/Alg%C3%A8bre_g%C3%A9n%C3%A9rale> <http://www.w3.org/2000/01/rdf-schema#label> "Alg\u00E8bre g\u00E9n\u00E9rale"@fr . Encoding & escaping: The de, el and ru IRIs seem to be UTF-8 encoded and () seem to be "unescaped". All other languages I tried seem to use % encoded URIs and escape the brackets with %28 / %29. Example: from interlang-i18n/el/interlanguage_links_el.nt.bz2 <http://el.dbpedia.org/resource/Ταυ> <http://www.w3.org/2002/07/owl#sameAs> <http://de.dbpedia.org/resource/Tau_(Buchstabe)> . <http://el.dbpedia.org/resource/Ταυ> <http://www.w3.org/2002/07/owl#sameAs> <http://dbpedia.org/resource/Tau> . <http://el.dbpedia.org/resource/Ταυ> <http://www.w3.org/2002/07/owl#sameAs> <http://nl.dbpedia.org/resource/Tau_%28letter%29> . <http://el.dbpedia.org/resource/Ψι> <http://www.w3.org/2002/07/owl#sameAs> <http://fr.dbpedia.org/resource/Psi_%28lettre_grecque%29> . <http://el.dbpedia.org/resource/Ψι> <http://www.w3.org/2002/07/owl#sameAs> <http://ru.dbpedia.org/resource/Пси_(буква)> . Inter language links: en and de just link to each other and el. ( minor: interlanguage_links_en.n3.bz2 seems to have 3082 triples of this form: ?s owl:sameAs ?s ) All other languages seem to have owl:sameAs to all languages. Prefixing dbpedia.org with language codes: de, el and ru files contain URIs like de.dbpedia.org, el.dbpedia.org, ru.dbpedia.org. fr, es, nl, ... (all others i tried) lack the prefixes in all data files i tried, but the interlanguage_links_en.n3 show them. This leads to all interlanguage links pointing to http://fr.dbpedia.org/... point to nothing, as the data is at http://dbpedia.org/... Example: from fr/lables_fr.nt.bz2: <http://dbpedia.org/resource/Alg%C3%A8bre_g%C3%A9n%C3%A9rale> <http://www.w3.org/2000/01/rdf-schema#label> "Alg\u00E8bre g\u00E9n\u00E9rale"@fr . from interlang-i18n/fr/interlanguage_links_fr.nt.bz2: <http://fr.dbpedia.org/resource/Alg%C3%A8bre_g%C3%A9n%C3%A9rale> <http://www.w3.org/2002/07/owl#sameAs> <http://de.dbpedia.org/resource/Abstrakte_Algebra> . <http://fr.dbpedia.org/resource/Alg%C3%A8bre_g%C3%A9n%C3%A9rale> <http://www.w3.org/2002/07/owl#sameAs> <http://dbpedia.org/resource/Abstract_algebra> . I'd happily offer a hand to help fixing these issues in 3.8. Cheers, Jörn PS: yes, i just wanted to update http://joernhees.de/blog/2010/10/31/setting-up-a-local-dbpedia-mirror-with-virtuoso/ ------------------------------------------------------------------------------ Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ _______________________________________________ Dbpedia-discussion mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion
