[
https://issues.apache.org/jira/browse/STANBOL-804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13608532#comment-13608532
]
Christopher Sahnwaldt commented on STANBOL-804:
-----------------------------------------------
Hi,
I prepared the DBpedia 3.8 release and believe (hope!) I fixed the encoding
problems that we had.
There is a thread on the dbpedia-discussion list [1] about this issue. At the
moment, it looks like the problem may be not encoding but compression. How do
Stanbol / Jena uncompress bz2 files? Older versions of Commons Compress
couldn't handle concatenated bz2 streams: COMPRESS-162 COMPRESS-146 . (Since
3.8, DBpedia compresses files with pbzip2 [2]. It's much faster on multi-core
machines, but it produces concatenated bz2 streams.)
Cheers,
JC
[1] http://sourceforge.net/mailarchive/message.php?msg_id=30623819
[2] http://compression.ca/pbzip2/
> Creating a spanish Index
> ------------------------
>
> Key: STANBOL-804
> URL: https://issues.apache.org/jira/browse/STANBOL-804
> Project: Stanbol
> Issue Type: Question
> Components: Entityhub
> Reporter: Juan Vargas
> Priority: Minor
> Fix For: entityhub-0.11.0
>
>
> Hello.
> I'm Juan Vargas. a web developer at Notedlinks S.L. from Spain.
> I've been trying a few days to create a spanish index using dbpedia 3.8
> files, following the next instructions of
> https://github.com/apache/stanbol/blob/trunk/entityhub/indexing/dbpedia/README.md
> to use on Stanbol enhancer, its means:
> 1. Building index tool
> - cd {stanbol-source}/entityhub/indexing/genericrdf/ (where you install
> stanbol) * require stanbol
> (http://stanbol.apache.org/docs/trunk/tutorial.html)
> - mvn assembly:single
> - move
> org.apache.stanbol.entityhub.indexing.dbpedia-*-jar-with-dependencies.jar on
> my target direct that i plan to make a index
> 2. Create sub-folder on target directory
> - java -jar
> org.apache.stanbol.entityhub.indexing.dbpedia-*-jar-with-dependencies.jar init
> 3. Download dbpedia dump files and copy in 'indexing/resources/rdfdata':
> http://downloads.dbpedia.org/3.8/dbpedia_3.6.owl.bz2 (general for any
> language)
> http://downloads.dbpedia.org/3.8/es/instance_types_es.nt.bz2
> http://downloads.dbpedia.org/3.8/es/labels_es.nt.bz2
> http://downloads.dbpedia.org/3.8/es/short_abstracts_es.nt.bz2
> http://downloads.dbpedia.org/3.8/es/long_abstracts_es.nt.bz2
> http://downloads.dbpedia.org/3.8/es/geo_coordinates_es.nt.bz2
> http://downloads.dbpedia.org/3.8/es/persondata_es.nt.bz2 (doesnt seem to
> exist in spanish, any problem it isnt use ?)
> http://downloads.dbpedia.org/3.8/es/article_categories_es.nt.bz2
> http://downloads.dbpedia.org/3.8/es/category_labels_es.nt.bz2
> http://downloads.dbpedia.org/3.8/es/skos_categories_es.nt.bz2
> http://downloads.dbpedia.org/3.8/es/redirects_es.nt.bz2
> 4. Generate entities score and copy to 'indexing/resources':
> - curl http://downloads.dbpedia.org/3.8/es/page_links_en.nt.bz2 | bzcat |
> sed -e 's/.*<http\:\/\/es\.dbpedia\.org\/resource\/\([^>]*\)> ./\1/' | sort \
> | uniq -c | sort -nr > incoming_links.txt (changes in spanish: url
> resource, 'en' for 'es', see suggested notes on url web)
> 5. Configuration of the index:
> - I left by default, otherwise i dont understand too much how to configurate.
> 6. Execute jar to create index:
> - java -jar
> org.apache.stanbol.entityhub.indexing.dbpedia-*-jar-with-dependencies.jar
> index
> The execution crash, and trace is as follows:
> 10:42:36,037 [Thread-3] ERROR source.ResourceLoader - Unable to load resource
> /home/juan/stanbol-index/indexing/resources/rdfdata/redirects_es.nt.bz2
> org.openjena.riot.RiotException: [line: 5854, col: 103] Broken token:
> http://es.dbpedia.org/resource/Pactos_de_
> at
> org.openjena.riot.ErrorHandlerLib$ErrorHandlerStd.fatal(ErrorHandlerLib.java:97)
> at org.openjena.riot.lang.LangBase.raiseException(LangBase.java:205)
> at org.openjena.riot.lang.LangBase.nextToken(LangBase.java:152)
> at org.openjena.riot.lang.LangNQuads.parseOne(LangNQuads.java:42)
> at org.openjena.riot.lang.LangNQuads.parseOne(LangNQuads.java:22)
> at org.openjena.riot.lang.LangNTuple.runParser(LangNTuple.java:58)
> at org.openjena.riot.lang.LangBase.parse(LangBase.java:75)
> at org.openjena.riot.RiotReader.parseQuads(RiotReader.java:173)
> at
> com.hp.hpl.jena.tdb.store.bulkloader.BulkLoader.loadQuads$(BulkLoader.java:154)
> at
> com.hp.hpl.jena.tdb.store.bulkloader.BulkLoader.loadDataset(BulkLoader.java:113)
> at com.hp.hpl.jena.tdb.TDBLoader.loadDataset$(TDBLoader.java:282)
> at com.hp.hpl.jena.tdb.TDBLoader.loadDataset(TDBLoader.java:193)
> at com.hp.hpl.jena.tdb.TDBLoader.load(TDBLoader.java:74)
> at
> org.apache.stanbol.entityhub.indexing.source.jenatdb.RdfResourceImporter.importResource(RdfResourceImporter.java:75)
> at
> org.apache.stanbol.entityhub.indexing.core.source.ResourceLoader.loadResource(ResourceLoader.java:201)
> at
> org.apache.stanbol.entityhub.indexing.core.source.ResourceLoader.loadResources(ResourceLoader.java:137)
> at
> org.apache.stanbol.entityhub.indexing.source.jenatdb.RdfIndexingSource.initialise(RdfIndexingSource.java:272)
> at
> org.apache.stanbol.entityhub.indexing.core.impl.IndexingSourceInitialiser.run(IndexingSourceInitialiser.java:43)
> at java.lang.Thread.run(Thread.java:679)
> Looking redirects_es.nt.bz2 file:
> 5852 <http://es.dbpedia.org/resource/Tratados_Lateranos>
> <http://dbpedia.org/ontology/wikiPageRedirects>
> <http://es.dbpedia.org/resource/Pactos_de_Letr\u00E1n> .
> 5853 <http://es.dbpedia.org/resource/Tratado_Laterano>
> <http://dbpedia.org/ontology/wikiPageRedirects>
> <http://es.dbpedia.org/resource/Pactos_de_Letr\u00E1n> .
> 5854 <http://es.dbpedia.org/resource/Tratado_Lateranense>
> <http://dbpedia.org/ontology/wikiPageRedirects>
> <http://es.dbpedia.org/resource/Pactos_de_Letr\u00E1n> .
> 5855 <http://es.dbpedia.org/resource/Tratados_Lateranenses>
> <http://dbpedia.org/ontology/wikiPageRedirects>
> <http://es.dbpedia.org/resource/Pactos_de_Letr\u00E1n> .
> I dont see any error. Someone could help me, if there are anything unusual?
> Also, i try to do a dbpedia 3.8 englsih version, to check if i wad doing
> wrong a spanish version, its seems ok, but finally minutes after, i got::
> 11:23:32,576 [Thread-3] ERROR source.ResourceLoader - Unable to load resource
> /home/juan/stanbol-index/indexing/resources/rdfdata/short_abstracts_en.nt.bz2
> org.openjena.riot.RiotException: [line: 1880, col: 96] Broken token: Bambara,
> also known as Bamana, and Bamanankan by speakers of the language, is a
> language spoken in Mali, and to a lesser extent Burkina Faso, Senegal by as
> many as six million people (in
> at
> org.openjena.riot.ErrorHandlerLib$ErrorHandlerStd.fatal(ErrorHandlerLib.java:97)
> at org.openjena.riot.lang.LangBase.raiseException(LangBase.java:205)
> at org.openjena.riot.lang.LangBase.nextToken(LangBase.java:152)
> at org.openjena.riot.lang.LangNQuads.parseOne(LangNQuads.java:42)
> at org.openjena.riot.lang.LangNQuads.parseOne(LangNQuads.java:22)
> at org.openjena.riot.lang.LangNTuple.runParser(LangNTuple.java:58)
> at org.openjena.riot.lang.LangBase.parse(LangBase.java:75)
> at org.openjena.riot.RiotReader.parseQuads(RiotReader.java:173)
> at
> com.hp.hpl.jena.tdb.store.bulkloader.BulkLoader.loadQuads$(BulkLoader.java:154)
> at
> com.hp.hpl.jena.tdb.store.bulkloader.BulkLoader.loadDataset(BulkLoader.java:113)
> at com.hp.hpl.jena.tdb.TDBLoader.loadDataset$(TDBLoader.java:282)
> at com.hp.hpl.jena.tdb.TDBLoader.loadDataset(TDBLoader.java:193)
> at com.hp.hpl.jena.tdb.TDBLoader.load(TDBLoader.java:74)
> at
> org.apache.stanbol.entityhub.indexing.source.jenatdb.RdfResourceImporter.importResource(RdfResourceImporter.java:75)
> at
> org.apache.stanbol.entityhub.indexing.core.source.ResourceLoader.loadResource(ResourceLoader.java:201)
> at
> org.apache.stanbol.entityhub.indexing.core.source.ResourceLoader.loadResources(ResourceLoader.java:137)
> at
> org.apache.stanbol.entityhub.indexing.source.jenatdb.RdfIndexingSource.initialise(RdfIndexingSource.java:272)
> at
> org.apache.stanbol.entityhub.indexing.core.impl.IndexingSourceInitialiser.run(IndexingSourceInitialiser.java:43)
> at java.lang.Thread.run(Thread.java:679)
> Looking short_abstracts_en.nt.bz2:
> 1879 <http://dbpedia.org/resource/Bernard_of_Clairvaux>
> <http://www.w3.org/2000/01/rdf-schema#comment> "Bernard of Clairvaux, O. Cist
> (1090 \u2013 August 20, 1153) was a French abbot and the primary builder of
> the reforming Cistercian order. After the death of his mother, Bernard sought
> admission into the Cistercian order. Three years later, he was sent to found
> a new abbey at an isolated clearing in a glen known as the Val d'Absinthe,
> about 15\u00A0km southeast of Bar-sur-Aube. According to tradition, Bernard
> founded the monastery on 25 June 1115, naming it Claire Vall\u00E9e, which
> evolved into Clairvaux."@en .
> 1880 <http://dbpedia.org/resource/Bambara_language>
> <http://www.w3.org/2000/01/rdf-schema#comment> "Bambara, also known as
> Bamana, and Bamanankan by speakers of the language, is a language spoken in
> Mali, and to a lesser extent Burkina Faso, Senegal by as many as six million
> people (including second language users). The Bambara language is the
> language of people of the Bambara ethnic group, numbering about 4,000,000
> people, but serves also as a lingua franca in Mali (it is estimated that
> about 80% of the population speak it as a first or second language)."@en .
> 1881 <http://dbpedia.org/resource/Bishkek>
> <http://www.w3.org/2000/01/rdf-schema#comment> "Bishkek, formerly Pishpek and
> Frunze, is the capital and the largest city of Kyrgyzstan. Bishkek is also
> the administrative centre of Chuy Province which surrounds the city, even
> though the city itself is not part of the province but rather a
> province-level unit of Kyrgyzstan. The name is thought to derive from a
> Kyrgyz word for a churn used to make fermented mare's milk, the Kyrgyz
> national drink."@en .
> Someone might say why appears errors like "broken pipe" or if I'm doing
> something wrong. I think that i follow well the guide. Thanks, and I hope
> that this information can help others that try to create indexes and an
> Apache Stanbol, that is a really great project. Nice work!
> Best,
> Juan.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira