Hi Jörn, most of these problems should be fixed in the current development version of the code.
1. Text encoding Starting with 3.8, we will be able to generate all kinds of different file formats, mainly N-Triples, N-Quads and Turtle, but also TriX. The N-Triples and N-Quads files will be ASCII only, as required by the spec. Of course, that means they are hardly human-readable for non-Latin languages. The Turtle files use no special Turtle features like @prefix, they are basically N-Triples files that use UTF-8 encoding. They are human human-readable for all languages, because only some special ASCII characters have to be escaped in Turtle. We also produce turtle-quads files. I don't think there is a formal specification for that format - it's basically N-Quads, but with Turtle rules for \u escaping, i.e. using UTF-8 instead of most \u escapes. If you still find any encoding problems, please let us know. The Turtle rules may not be implemented correctly for higher Unicode planes. 2. URI escaping Also starting with 3.8, there will be a very simple configuration switch to chose between IRIs (only a few special ASCII chars are percent-escaped) and URIs (special ASCII chars and all non-ASCII chars are percent-escaped). Both can be written to different files during one extraction run. We will probably have to chose 'canonical' DBpedia URIs/IRIs though, so we may not publish both versions. In both cases, round brackets "()" will not be escaped because the RFCs for URIs and IRIs do not mandate it. I hope this won't cause too many backwards-compatibility problems. If it does, it's trivial to change this behavior since I we now have our own configurable URI encoder. Again, the problems should be fixed, but especially for higher Unicode planes, IRI escaping may not be quite correct. We're not sure yet which of these files we will publish. There are twelve different format combinations (quads/triples, URIs/IRIs, NT/Turtle, etc), and it doesn't make sense to publish them all. In the past, we only published NT and NQ files, but I would like to also offer Turtle. 3. Inter language links and 4. DBpedia language domains are fodder for a separate mail... So much for now, Christopher On Mon, May 7, 2012 at 12:19 PM, Jörn Hees <[email protected]> wrote: > Hi, > > i'm currently struggling with the DBpedia 3.7 dumps, but as the DBpedia 3.8 > seems to be on the road i thought i'd let you know about some of the problems > i encountered by now which make it tricky to work with the dumps. > > I downloaded the 3.7 all_languages.tar and the all_languages-i18n.tar and the > interlang-i18n.tar . > > The problems i encountered are mainly related to the i18n dataset, but can > also be found in the interlanguage links of the "traditional" dump (in the > all_languages.tar): > > It seems as if the de, el and ru wikipedia were exported in a different way > from other languages, as their encodings are different and they have the > language prefix in the URIs (de.dbpedia.org). They use UTF-8 IRIs in the dump > files, while all other languages i tried use % escaped URIs and don't have a > language prefixed URI. > > This leads to a couple of encoding issues (.nt files are ASCII only and > normally use % encoding, but de, el and ru contain UTF-8 here) and > inconsistencies (e.g., interlanguage links to fr.dbpedia.org pointing into > nothing). > Also the interlanguage link files show different ways of encoding within the > same file, making them tricky to load. > > > Details where you can see these issue: (the following is all related to the > 3.7-i18n dumps!) > ========================================================================= > > File Encoding: > Mainly the de, el and ru, as well as the interlanguage_link dumps contain .nt > files with UTF-8 encoding (for IRIs). This isn't valid .nt, so maybe consider > renaming all those files to .n3? > Example: > from de/labels_de.nt.bz2: > <http://de.dbpedia.org/resource/Anschlussfähigkeit> > <http://www.w3.org/2000/01/rdf-schema#label> "Anschlussf\u00E4higkeit"@de . > from fr/lables_fr.nt.bz2: > <http://dbpedia.org/resource/Alg%C3%A8bre_g%C3%A9n%C3%A9rale> > <http://www.w3.org/2000/01/rdf-schema#label> "Alg\u00E8bre > g\u00E9n\u00E9rale"@fr . > > Encoding & escaping: > The de, el and ru IRIs seem to be UTF-8 encoded and () seem to be "unescaped". > All other languages I tried seem to use % encoded URIs and escape the > brackets with %28 / %29. > Example: > from interlang-i18n/el/interlanguage_links_el.nt.bz2 > <http://el.dbpedia.org/resource/Ταυ> <http://www.w3.org/2002/07/owl#sameAs> > <http://de.dbpedia.org/resource/Tau_(Buchstabe)> . > <http://el.dbpedia.org/resource/Ταυ> <http://www.w3.org/2002/07/owl#sameAs> > <http://dbpedia.org/resource/Tau> . > <http://el.dbpedia.org/resource/Ταυ> <http://www.w3.org/2002/07/owl#sameAs> > <http://nl.dbpedia.org/resource/Tau_%28letter%29> . > <http://el.dbpedia.org/resource/Ψι> <http://www.w3.org/2002/07/owl#sameAs> > <http://fr.dbpedia.org/resource/Psi_%28lettre_grecque%29> . > <http://el.dbpedia.org/resource/Ψι> <http://www.w3.org/2002/07/owl#sameAs> > <http://ru.dbpedia.org/resource/Пси_(буква)> . > > Inter language links: > en and de just link to each other and el. > ( minor: interlanguage_links_en.n3.bz2 seems to have 3082 triples of this > form: ?s owl:sameAs ?s ) > All other languages seem to have owl:sameAs to all languages. > > Prefixing dbpedia.org with language codes: > de, el and ru files contain URIs like de.dbpedia.org, el.dbpedia.org, > ru.dbpedia.org. > fr, es, nl, ... (all others i tried) lack the prefixes in all data files i > tried, but the interlanguage_links_en.n3 show them. > This leads to all interlanguage links pointing to http://fr.dbpedia.org/... > point to nothing, as the data is at http://dbpedia.org/... > Example: > from fr/lables_fr.nt.bz2: > <http://dbpedia.org/resource/Alg%C3%A8bre_g%C3%A9n%C3%A9rale> > <http://www.w3.org/2000/01/rdf-schema#label> "Alg\u00E8bre > g\u00E9n\u00E9rale"@fr . > from interlang-i18n/fr/interlanguage_links_fr.nt.bz2: > <http://fr.dbpedia.org/resource/Alg%C3%A8bre_g%C3%A9n%C3%A9rale> > <http://www.w3.org/2002/07/owl#sameAs> > <http://de.dbpedia.org/resource/Abstrakte_Algebra> . > <http://fr.dbpedia.org/resource/Alg%C3%A8bre_g%C3%A9n%C3%A9rale> > <http://www.w3.org/2002/07/owl#sameAs> > <http://dbpedia.org/resource/Abstract_algebra> . > > I'd happily offer a hand to help fixing these issues in 3.8. > > Cheers, > Jörn > > PS: yes, i just wanted to update > http://joernhees.de/blog/2010/10/31/setting-up-a-local-dbpedia-mirror-with-virtuoso/ > > > ------------------------------------------------------------------------------ > Live Security Virtual Conference > Exclusive live event will cover all the ways today's security and > threat landscape has changed and how IT managers can respond. Discussions > will include endpoint security, mobile security and the latest in malware > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > _______________________________________________ > Dbpedia-discussion mailing list > [email protected] > https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion ------------------------------------------------------------------------------ Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ _______________________________________________ Dbpedia-discussion mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion
