Hi Juan, the RDF dump files from DBpedia do contain invalid UTF8 characters. With dbpedia version 3.7 this affected only very few files. In version 3.8 much more files are affected. Because of that I have recently created a shell script that corrects such errors for all files.
see http://markmail.org/message/67ivlyoxfqad6xoe for details. Was this basically does is executing the following command on all files bzcat ${filename}.bz2 \ | sed 's/\\\\/\\u005c\\u005c/g;s/\\\([^u"]\)/\\u005c\1/g' \ | gzip -c > ${filename}.gz rm -f ${filename}.bz2 so you can also run this manually for all files in your rdfdata folder best Rupert On Tue, Nov 13, 2012 at 1:20 PM, Juan Vargas <[email protected]> wrote: > Hello. > > I'm Juan Vargas. a web developer at Notedlinks S.L. from Spain. (Issue: > https://issues.apache.org/jira/browse/STANBOL-804) > > I've been trying a few days to create a spanish index using dbpedia 3.8 > files, following the next instructions of > https://github.com/apache/stanbol/blob/trunk/entityhub/indexing/dbpedia/README.mdto > use on Stanbol enhancer, its means: > > *1. Building index tool* > - cd {stanbol-source}/entityhub/ > indexing/genericrdf/ (where you install stanbol) * require stanbol ( > http://stanbol.apache.org/docs/trunk/tutorial.html) > - mvn assembly:single > - > moveorg.apache.stanbol.entityhub.indexing.dbpedia-*-jar-with-dependencies.jaron > my target direct that i plan to make a index > > *2. Create sub-folder on target directory* > - java -jar > org.apache.stanbol.entityhub.indexing.dbpedia-*-jar-with-dependencies.jar > init > > *3. Download dbpedia dump files and copy in* 'indexing/resources/rdfdata': > > - http://downloads.dbpedia.org/3.8/dbpedia_3.6.owl.bz2 (general for > any language) > - http://downloads.dbpedia.org/3.8/en/instance_types_es.nt.bz2 > - http://downloads.dbpedia.org/3.8/es/labels_es.nt.bz2 > - http://downloads.dbpedia.org/3.8/es/short_abstracts_es.nt.bz2 > - http://downloads.dbpedia.org/3.8/es/long_abstracts_es.nt.bz2 > - http://downloads.dbpedia.org/3.8/es/geo_coordinates_es.nt.bz2 > - http://downloads.dbpedia.org/3.8/es/persondata_es.nt.bz2 (doesnt seem > to exist in spanish, any problem it isnt use ?) > - http://downloads.dbpedia.org/3.8/es/article_categories_es.nt.bz2 > - http://downloads.dbpedia.org/3.8/es/category_labels_es.nt.bz2 > - http://downloads.dbpedia.org/3.8/es/skos_categories_es.nt.bz2 > - http://downloads.dbpedia.org/3.8/en/redirects_es.nt.bz2 > > > *4. Generate entities score and copy to** '*indexing/resources': > - curl http://downloads.dbpedia.org/3.8/es/page_links_en.nt.bz2 | bzcat | > sed -e 's/.*<http\:\/\/es\.dbpedia\.org\/resource\/\([^>]*\)> ./\1/' | sort > \ | uniq -c | sort -nr > incoming_links.txt > > (changes in spanish: url resource, 'en' for 'es', see suggested notes on > url web) > > *5. Configuration of the index:* > - I left by default, otherwise i dont understand too much how to > configurate. > > *6. Execute jar to create index:* > - java -jar > org.apache.stanbol.entityhub.indexing.dbpedia-*-jar-with-dependencies.jar > index > > The execution crash, and trace is as follows: > > 10:42:36,037 [Thread-3] ERROR source.ResourceLoader - Unable to load > resource > /home/juan/stanbol-index/indexing/resources/rdfdata/redirects_es.nt.bz2 > org.openjena.riot.RiotException: [line: *5854*, col: 103] *Broken token*: > http://es.dbpedia.org/resource/Pactos_de_ > at > org.openjena.riot.ErrorHandlerLib$ErrorHandlerStd.fatal(ErrorHandlerLib.java:97) > at org.openjena.riot.lang.LangBase.raiseException(LangBase.java:205) > at org.openjena.riot.lang.LangBase.nextToken(LangBase.java:152) > at org.openjena.riot.lang.LangNQuads.parseOne(LangNQuads.java:42) > at org.openjena.riot.lang.LangNQuads.parseOne(LangNQuads.java:22) > at org.openjena.riot.lang.LangNTuple.runParser(LangNTuple.java:58) > at org.openjena.riot.lang.LangBase.parse(LangBase.java:75) > at org.openjena.riot.RiotReader.parseQuads(RiotReader.java:173) > at > com.hp.hpl.jena.tdb.store.bulkloader.BulkLoader.loadQuads$(BulkLoader.java:154) > at > com.hp.hpl.jena.tdb.store.bulkloader.BulkLoader.loadDataset(BulkLoader.java:113) > at com.hp.hpl.jena.tdb.TDBLoader.loadDataset$(TDBLoader.java:282) > at com.hp.hpl.jena.tdb.TDBLoader.loadDataset(TDBLoader.java:193) > at com.hp.hpl.jena.tdb.TDBLoader.load(TDBLoader.java:74) > at > org.apache.stanbol.entityhub.indexing.source.jenatdb.RdfResourceImporter.importResource(RdfResourceImporter.java:75) > at > org.apache.stanbol.entityhub.indexing.core.source.ResourceLoader.loadResource(ResourceLoader.java:201) > at > org.apache.stanbol.entityhub.indexing.core.source.ResourceLoader.loadResources(ResourceLoader.java:137) > at > org.apache.stanbol.entityhub.indexing.source.jenatdb.RdfIndexingSource.initialise(RdfIndexingSource.java:272) > at > org.apache.stanbol.entityhub.indexing.core.impl.IndexingSourceInitialiser.run(IndexingSourceInitialiser.java:43) > at java.lang.Thread.run(Thread.java:679) > > Looking redirects_es.nt.bz2 file: > > 5852 <http://es.dbpedia.org/resource/Tratados_Lateranos> < > http://dbpedia.org/ontology/wikiPageRedirects> < > http://es.dbpedia.org/resource/Pactos_de_Letr\u00E1n> . > 5853 <http://es.dbpedia.org/resource/Tratado_Laterano> < > http://dbpedia.org/ontology/wikiPageRedirects> < > http://es.dbpedia.org/resource/Pactos_de_Letr\u00E1n> . > * 5854* <http://es.dbpedia.org/resource/Tratado_Lateranense> < > http://dbpedia.org/ontology/wikiPageRedirects> < > http://es.dbpedia.org/resource/Pactos_de_Letr\u00E1n> . > 5855 <http://es.dbpedia.org/resource/Tratados_Lateranenses> < > http://dbpedia.org/ontology/wikiPageRedirects> < > http://es.dbpedia.org/resource/Pactos_de_Letr\u00E1n> . > > I dont see any error. Someone could help me, if there are anything unusual? > > Also, i try to do a dbpedia 3.8 englsih version, to check if i wad doing > wrong a spanish version, its seems ok, but finally minutes after, i got:: > > 11:23:32,576 [Thread-3] ERROR source.ResourceLoader - Unable to load > resource > /home/juan/stanbol-index/indexing/resources/rdfdata/short_abstracts_en.nt.bz2 > org.openjena.riot.RiotException: [line: *1880*, col: 96] *Broken token*: > Bambara, also known as Bamana, and Bamanankan by speakers of the language, > is a language spoken in Mali, and to a lesser extent Burkina Faso, Senegal > by as many as six million people (in > at > org.openjena.riot.ErrorHandlerLib$ErrorHandlerStd.fatal(ErrorHandlerLib.java:97) > at org.openjena.riot.lang.LangBase.raiseException(LangBase.java:205) > at org.openjena.riot.lang.LangBase.nextToken(LangBase.java:152) > at org.openjena.riot.lang.LangNQuads.parseOne(LangNQuads.java:42) > at org.openjena.riot.lang.LangNQuads.parseOne(LangNQuads.java:22) > at org.openjena.riot.lang.LangNTuple.runParser(LangNTuple.java:58) > at org.openjena.riot.lang.LangBase.parse(LangBase.java:75) > at org.openjena.riot.RiotReader.parseQuads(RiotReader.java:173) > at > com.hp.hpl.jena.tdb.store.bulkloader.BulkLoader.loadQuads$(BulkLoader.java:154) > at > com.hp.hpl.jena.tdb.store.bulkloader.BulkLoader.loadDataset(BulkLoader.java:113) > at com.hp.hpl.jena.tdb.TDBLoader.loadDataset$(TDBLoader.java:282) > at com.hp.hpl.jena.tdb.TDBLoader.loadDataset(TDBLoader.java:193) > at com.hp.hpl.jena.tdb.TDBLoader.load(TDBLoader.java:74) > at > org.apache.stanbol.entityhub.indexing.source.jenatdb.RdfResourceImporter.importResource(RdfResourceImporter.java:75) > at > org.apache.stanbol.entityhub.indexing.core.source.ResourceLoader.loadResource(ResourceLoader.java:201) > at > org.apache.stanbol.entityhub.indexing.core.source.ResourceLoader.loadResources(ResourceLoader.java:137) > at > org.apache.stanbol.entityhub.indexing.source.jenatdb.RdfIndexingSource.initialise(RdfIndexingSource.java:272) > at > org.apache.stanbol.entityhub.indexing.core.impl.IndexingSourceInitialiser.run(IndexingSourceInitialiser.java:43) > at java.lang.Thread.run(Thread.java:679) > > Looking short_abstracts_en.nt.bz2: > > 1879 <http://dbpedia.org/resource/Bernard_of_Clairvaux> < > http://www.w3.org/2000/01/rdf-schema#comment> "Bernard of Clairvaux, O. > Cist (1090 \u2013 August 20, 1153) was a French abbot and the primary > builder of the reforming Cistercian order. After the death of his mother, > Bernard sought admission into the Cistercian order. Three years later, he > was sent to found a new abbey at an isolated clearing in a glen known as > the Val d'Absinthe, about 15\u00A0km southeast of Bar-sur-Aube. According > to tradition, Bernard founded the monastery on 25 June 1115, naming it > Claire Vall\u00E9e, which evolved into Clairvaux."@en . > *1880 *<http://dbpedia.org/resource/Bambara_language> < > http://www.w3.org/2000/01/rdf-schema#comment> "Bambara, also known as > Bamana, and Bamanankan by speakers of the language, is a language spoken in > Mali, and to a lesser extent Burkina Faso, Senegal by as many as six > million people (including second language users). The Bambara language is > the language of people of the Bambara ethnic group, numbering about > 4,000,000 people, but serves also as a lingua franca in Mali (it is > estimated that about 80% of the population speak it as a first or second > language)."@en . > 1881 <http://dbpedia.org/resource/Bishkek> < > http://www.w3.org/2000/01/rdf-schema#comment> "Bishkek, formerly Pishpek > and Frunze, is the capital and the largest city of Kyrgyzstan. Bishkek is > also the administrative centre of Chuy Province which surrounds the city, > even though the city itself is not part of the province but rather a > province-level unit of Kyrgyzstan. The name is thought to derive from a > Kyrgyz word for a churn used to make fermented mare's milk, the Kyrgyz > national drink."@en . > > Someone might say why appears errors like "broken pipe" or if I'm doing > something wrong. I think that i follow well the guide. Thanks, and I hope that > this information can help others that try to create indexes and an Apache > Stanbol, that is a really great project. Nice work! > > Best, > Juan. -- | Rupert Westenthaler [email protected] | Bodenlehenstraße 11 ++43-699-11108907 | A-5500 Bischofshofen
