[jira] [Commented] (STANBOL-804) Creating a spanish Index

Christopher Sahnwaldt (JIRA) Wed, 20 Mar 2013 19:01:18 -0700

    [ 
https://issues.apache.org/jira/browse/STANBOL-804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13608532#comment-13608532
 ]


Christopher Sahnwaldt commented on STANBOL-804:
-----------------------------------------------

Hi,

I prepared the DBpedia 3.8 release and believe (hope!) I fixed the encoding 
problems that we had.

There is a thread on the dbpedia-discussion list [1] about this issue. At the 
moment, it looks like the problem may be not encoding but compression. How do 
Stanbol / Jena uncompress bz2 files? Older versions of Commons Compress 
couldn't handle concatenated bz2 streams: COMPRESS-162 COMPRESS-146 . (Since 
3.8, DBpedia compresses files with pbzip2 [2]. It's much faster on multi-core 
machines, but it produces concatenated bz2 streams.)

Cheers,
JC

[1] http://sourceforge.net/mailarchive/message.php?msg_id=30623819
[2] http://compression.ca/pbzip2/
                
> Creating a spanish Index
> ------------------------
>
>                 Key: STANBOL-804
>                 URL: https://issues.apache.org/jira/browse/STANBOL-804
>             Project: Stanbol
>          Issue Type: Question
>          Components: Entityhub
>            Reporter: Juan Vargas
>            Priority: Minor
>             Fix For: entityhub-0.11.0
>
>
> Hello.
> I'm Juan Vargas. a web developer at Notedlinks S.L. from Spain.
> I've been trying a few days to create a spanish index using dbpedia 3.8 
> files, following the next instructions of 
> https://github.com/apache/stanbol/blob/trunk/entityhub/indexing/dbpedia/README.md
>  to use on Stanbol enhancer, its means:
> 1. Building index tool
>    - cd {stanbol-source}/entityhub/indexing/genericrdf/  (where you install 
> stanbol) * require stanbol 
> (http://stanbol.apache.org/docs/trunk/tutorial.html)
>    - mvn assembly:single
>    - move 
> org.apache.stanbol.entityhub.indexing.dbpedia-*-jar-with-dependencies.jar on 
> my target direct that i plan to make a index
> 2. Create sub-folder on target directory
>    - java -jar 
> org.apache.stanbol.entityhub.indexing.dbpedia-*-jar-with-dependencies.jar init
> 3. Download dbpedia dump files and copy in 'indexing/resources/rdfdata':
>     http://downloads.dbpedia.org/3.8/dbpedia_3.6.owl.bz2    (general for any 
> language)
>     http://downloads.dbpedia.org/3.8/es/instance_types_es.nt.bz2
>     http://downloads.dbpedia.org/3.8/es/labels_es.nt.bz2
>     http://downloads.dbpedia.org/3.8/es/short_abstracts_es.nt.bz2
>     http://downloads.dbpedia.org/3.8/es/long_abstracts_es.nt.bz2
>     http://downloads.dbpedia.org/3.8/es/geo_coordinates_es.nt.bz2
>     http://downloads.dbpedia.org/3.8/es/persondata_es.nt.bz2  (doesnt seem to 
> exist in spanish, any problem it isnt use ?)
>     http://downloads.dbpedia.org/3.8/es/article_categories_es.nt.bz2
>     http://downloads.dbpedia.org/3.8/es/category_labels_es.nt.bz2
>     http://downloads.dbpedia.org/3.8/es/skos_categories_es.nt.bz2
>     http://downloads.dbpedia.org/3.8/es/redirects_es.nt.bz2
> 4. Generate entities score and copy to 'indexing/resources':
>   - curl http://downloads.dbpedia.org/3.8/es/page_links_en.nt.bz2 | bzcat | 
> sed -e 's/.*<http\:\/\/es\.dbpedia\.org\/resource\/\([^>]*\)> ./\1/' | sort \ 
> | uniq -c | sort -nr > incoming_links.txt   (changes in spanish: url 
> resource, 'en' for 'es', see suggested notes on url web)
> 5. Configuration of the index:
>  - I left by default, otherwise i dont understand too much how to configurate.
> 6. Execute jar to create index:
>   - java -jar 
> org.apache.stanbol.entityhub.indexing.dbpedia-*-jar-with-dependencies.jar 
> index
> The execution crash, and trace is as follows:
> 10:42:36,037 [Thread-3] ERROR source.ResourceLoader - Unable to load resource 
> /home/juan/stanbol-index/indexing/resources/rdfdata/redirects_es.nt.bz2
> org.openjena.riot.RiotException: [line: 5854, col: 103] Broken token: 
> http://es.dbpedia.org/resource/Pactos_de_
>     at 
> org.openjena.riot.ErrorHandlerLib$ErrorHandlerStd.fatal(ErrorHandlerLib.java:97)
>     at org.openjena.riot.lang.LangBase.raiseException(LangBase.java:205)
>     at org.openjena.riot.lang.LangBase.nextToken(LangBase.java:152)
>     at org.openjena.riot.lang.LangNQuads.parseOne(LangNQuads.java:42)
>     at org.openjena.riot.lang.LangNQuads.parseOne(LangNQuads.java:22)
>     at org.openjena.riot.lang.LangNTuple.runParser(LangNTuple.java:58)
>     at org.openjena.riot.lang.LangBase.parse(LangBase.java:75)
>     at org.openjena.riot.RiotReader.parseQuads(RiotReader.java:173)
>     at 
> com.hp.hpl.jena.tdb.store.bulkloader.BulkLoader.loadQuads$(BulkLoader.java:154)
>     at 
> com.hp.hpl.jena.tdb.store.bulkloader.BulkLoader.loadDataset(BulkLoader.java:113)
>     at com.hp.hpl.jena.tdb.TDBLoader.loadDataset$(TDBLoader.java:282)
>     at com.hp.hpl.jena.tdb.TDBLoader.loadDataset(TDBLoader.java:193)
>     at com.hp.hpl.jena.tdb.TDBLoader.load(TDBLoader.java:74)
>     at 
> org.apache.stanbol.entityhub.indexing.source.jenatdb.RdfResourceImporter.importResource(RdfResourceImporter.java:75)
>     at 
> org.apache.stanbol.entityhub.indexing.core.source.ResourceLoader.loadResource(ResourceLoader.java:201)
>     at 
> org.apache.stanbol.entityhub.indexing.core.source.ResourceLoader.loadResources(ResourceLoader.java:137)
>     at 
> org.apache.stanbol.entityhub.indexing.source.jenatdb.RdfIndexingSource.initialise(RdfIndexingSource.java:272)
>     at 
> org.apache.stanbol.entityhub.indexing.core.impl.IndexingSourceInitialiser.run(IndexingSourceInitialiser.java:43)
>     at java.lang.Thread.run(Thread.java:679)
> Looking redirects_es.nt.bz2 file:
>   5852 <http://es.dbpedia.org/resource/Tratados_Lateranos> 
> <http://dbpedia.org/ontology/wikiPageRedirects> 
> <http://es.dbpedia.org/resource/Pactos_de_Letr\u00E1n> .
>    5853 <http://es.dbpedia.org/resource/Tratado_Laterano> 
> <http://dbpedia.org/ontology/wikiPageRedirects> 
> <http://es.dbpedia.org/resource/Pactos_de_Letr\u00E1n> .
>    5854 <http://es.dbpedia.org/resource/Tratado_Lateranense> 
> <http://dbpedia.org/ontology/wikiPageRedirects> 
> <http://es.dbpedia.org/resource/Pactos_de_Letr\u00E1n> .
>    5855 <http://es.dbpedia.org/resource/Tratados_Lateranenses> 
> <http://dbpedia.org/ontology/wikiPageRedirects> 
> <http://es.dbpedia.org/resource/Pactos_de_Letr\u00E1n> .
> I dont see any error. Someone could help me, if there are anything unusual?
> Also, i try to do a dbpedia 3.8 englsih version, to check if i wad doing 
> wrong a spanish version, its seems ok, but finally minutes after, i got::
> 11:23:32,576 [Thread-3] ERROR source.ResourceLoader - Unable to load resource 
> /home/juan/stanbol-index/indexing/resources/rdfdata/short_abstracts_en.nt.bz2
> org.openjena.riot.RiotException: [line: 1880, col: 96] Broken token: Bambara, 
> also known as Bamana, and Bamanankan by speakers of the language, is a 
> language spoken in Mali, and to a lesser extent Burkina Faso, Senegal by as 
> many as six million people (in
>     at 
> org.openjena.riot.ErrorHandlerLib$ErrorHandlerStd.fatal(ErrorHandlerLib.java:97)
>     at org.openjena.riot.lang.LangBase.raiseException(LangBase.java:205)
>     at org.openjena.riot.lang.LangBase.nextToken(LangBase.java:152)
>     at org.openjena.riot.lang.LangNQuads.parseOne(LangNQuads.java:42)
>     at org.openjena.riot.lang.LangNQuads.parseOne(LangNQuads.java:22)
>     at org.openjena.riot.lang.LangNTuple.runParser(LangNTuple.java:58)
>     at org.openjena.riot.lang.LangBase.parse(LangBase.java:75)
>     at org.openjena.riot.RiotReader.parseQuads(RiotReader.java:173)
>     at 
> com.hp.hpl.jena.tdb.store.bulkloader.BulkLoader.loadQuads$(BulkLoader.java:154)
>     at 
> com.hp.hpl.jena.tdb.store.bulkloader.BulkLoader.loadDataset(BulkLoader.java:113)
>     at com.hp.hpl.jena.tdb.TDBLoader.loadDataset$(TDBLoader.java:282)
>     at com.hp.hpl.jena.tdb.TDBLoader.loadDataset(TDBLoader.java:193)
>     at com.hp.hpl.jena.tdb.TDBLoader.load(TDBLoader.java:74)
>     at 
> org.apache.stanbol.entityhub.indexing.source.jenatdb.RdfResourceImporter.importResource(RdfResourceImporter.java:75)
>     at 
> org.apache.stanbol.entityhub.indexing.core.source.ResourceLoader.loadResource(ResourceLoader.java:201)
>     at 
> org.apache.stanbol.entityhub.indexing.core.source.ResourceLoader.loadResources(ResourceLoader.java:137)
>     at 
> org.apache.stanbol.entityhub.indexing.source.jenatdb.RdfIndexingSource.initialise(RdfIndexingSource.java:272)
>     at 
> org.apache.stanbol.entityhub.indexing.core.impl.IndexingSourceInitialiser.run(IndexingSourceInitialiser.java:43)
>     at java.lang.Thread.run(Thread.java:679)
> Looking short_abstracts_en.nt.bz2:
> 1879 <http://dbpedia.org/resource/Bernard_of_Clairvaux> 
> <http://www.w3.org/2000/01/rdf-schema#comment> "Bernard of Clairvaux, O. Cist 
> (1090 \u2013 August 20, 1153) was a French abbot and the primary builder of 
> the reforming Cistercian order. After the death of his mother, Bernard sought 
> admission into the Cistercian order. Three years later, he was sent to found 
> a new abbey at an isolated clearing in a glen known as the Val d'Absinthe, 
> about 15\u00A0km southeast of Bar-sur-Aube. According to tradition, Bernard 
> founded the monastery on 25 June 1115, naming it Claire Vall\u00E9e, which 
> evolved into Clairvaux."@en .
>    1880 <http://dbpedia.org/resource/Bambara_language> 
> <http://www.w3.org/2000/01/rdf-schema#comment> "Bambara, also known as 
> Bamana, and Bamanankan by speakers of the language, is a language spoken in 
> Mali, and to a lesser extent Burkina Faso, Senegal by as many as six million 
> people (including second language users). The Bambara language is the 
> language of people of the Bambara ethnic group, numbering about 4,000,000 
> people, but serves also as a lingua franca in Mali (it is estimated that 
> about 80% of the population speak it as a first or second language)."@en .
>    1881 <http://dbpedia.org/resource/Bishkek> 
> <http://www.w3.org/2000/01/rdf-schema#comment> "Bishkek, formerly Pishpek and 
> Frunze, is the capital and the largest city of Kyrgyzstan. Bishkek is also 
> the administrative centre of Chuy Province which surrounds the city, even 
> though the city itself is not part of the province but rather a 
> province-level unit of Kyrgyzstan. The name is thought to derive from a 
> Kyrgyz word for a churn used to make fermented mare's milk, the Kyrgyz 
> national drink."@en .
> Someone might say why appears errors like "broken pipe" or if I'm doing 
> something wrong. I think that i follow well the guide. Thanks, and I hope 
> that this information can help others that try to create indexes and an 
> Apache Stanbol, that is a really great project. Nice work!
> Best,
> Juan.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (STANBOL-804) Creating a spanish Index

Reply via email to