[Dbpedia-discussion] Bls: extraction problem

riko adi prasetya Mon, 04 Mar 2013 22:08:01 -0800

Hi Gaurav,

Be patient, 
I spent 4 hours for extracting Indonesian data dump.
I think it is depend on host spec and size of data dump.  
 
Yes, extracted triplets in the same source directory.


Cheers,
Riko


________________________________
 Dari: gaurav pant <[email protected]>
Kepada: riko adi prasetya <[email protected]> 
Dikirim: Selasa, 5 Maret 2013 12:38
Judul: Re: [Dbpedia-discussion] extraction problem
 

Hi Riko,

Thanks for your reply..i have tried with that change. Its running but from a 
long waiting at 

"
Mar 05, 2013 5:13:30 AM org.dbpedia.extraction.mappings.Redirects$ loadFromCache
INFO: Loading redirects from cache file 
/mnt/ebs/perl/framework/extraction-framework/dump/wiki_dump/dewiki/20130219/dewiki-20130219-template-redirects.obj
Mar 05, 2013 5:13:30 AM org.dbpedia.extraction.mappings.Redirects$ load
INFO: Will extract redirects from source for de wiki, could not load cache file 
'/mnt/ebs/perl/framework/extraction-framework/dump/wiki_dump/dewiki/20130219/dewiki-20130219-template-redirects.obj':
 java.io.FileNotFoundException: 
/mnt/ebs/perl/framework/extraction-framework/dump/wiki_dump/dewiki/20130219/dewiki-20130219-template-redirects.obj
 (No such file or directory)
Mar 05, 2013 5:13:30 AM org.dbpedia.extraction.mappings.Redirects$ 
loadFromSource
INFO: Loading redirects from source (de)
Mar 05, 2013 5:28:58 AM 
org.dbpedia.extraction.mappings.Redirects$RedirectFinder apply
WARNING: wrong redirect. page: 
[title=Mikrogramm;ns=0/Main/;language:wiki=de,locale=de].
found by dbpedia:   [title=Gramm;ns=0/Main/;language:wiki=de,locale=de].
found by wikipedia: [null]
"

Is it because I have downloaded file page-article file manually not using 
dbpedia-extraction and due to this other required file could not be downloaded?

Also where it will give extracted triplets...in the same source directory?



On Tue, Mar 5, 2013 at 10:56 AM, riko adi prasetya <[email protected]> 
wrote:

Hi Gaurav,
>
>
>Try to check again your extraction.de.property
>
>
>"# download and extraction target dir
>dir=/mnt/ebs/perl/framework/extraction-framework/dump/wiki_dump
>
># Source file. If source file name ends with .gz or .bz2, it is unzipped on 
>the fly. 
># Must exist in the directory xxwiki/20121231 and have the prefix 
>xxwiki-20121231-.
> 
># default:
># source=pages-articles.xml
>
># alternatives:
>source=pages-articles.xml.bz2
># source=pages-articles.xml.gz
>
># use only directories that contain a 'download-complete' file? Default is 
>false.
>require-download-complete=true
>
># unqualified extractor class names are prefixed by 
>org.dbpedia.extraction.mappings.
>
># All 111 languages that as of 2012-05-25 have 10000 articles or more.
># TODO: parse wikipedias.csv and figure out from there which languages to 
>extract.
># If no languages are given, the ones having a mapping namespace on 
>mappings.dbpedia.org are used 
>languages=de
>
>extractors=InfoboxExtractor
>#ArticleCategoriesExtractor,CategoryLabelExtractor,ExternalLinksExtractor,\
>#GeoExtractor,InfoboxExtractor,LabelExtractor,PageIdExtractor,PageLinksExtractor,\
>#RedirectExtractor,RevisionIdExtractor,SkosCategoriesExtractor,WikiPageExtractor
>
>extractors.de=InfoboxExtractor
>#extractors.de=MappingExtractor,DisambiguationExtractor,InterLanguageLinksExtractor,RedirectExtractor,LabelExtractor
>#extractors.en=MappingExtractor,DisambiguationExtractor,InterLanguageLinksExtractor,RedirectExtractor,LabelExtractor
>
># if ontology and mapping files are not given or do not exist, download info 
>from mappings.dbpedia.org
>ontology=../ontology.xml
>mappings=../mappings
>
># URI policies. Allowed flags: uri, generic, xml-safe. Each flag may have on 
>of the suffixes
># -subjects, -predicates, -objects, -datatype, -context to match only URIs in 
>a certain position. 
># Without a suffix, a flag matches all URI positions.
>
>uri-policy.uri=uri:en; generic:en; xml-safe-predicates:*
>uri-policy.iri=generic:en; xml-safe-predicates:*
>
>
># File formats. Allowed flags: n-triples, n-quads, turtle-triples, 
>turtle-quads, trix-triples, trix-quads
># May be followed by a semicolon and a URI policy name. If format name ends 
>with .gz or .bz2, files
># are zipped on the fly.
>
># NT is unreadable anyway - might as well use URIs
>format.nt=n-triples;uri-policy.uri
>#format.nq.gz=n-quads;uri-policy.uri
>
># Turtle is much more readable - use nice IRIs
>format.ttl=turtle-triples;uri-policy.iri
>#format.tql.gz=turtle-quads;uri-policy.iri
>"
>
>
>
>You write dir, so there is not base-dir in your extraction configuration.
> 
>Cheers,
>Riko
> 
>
>________________________________
>Riko Adi Prasetya
>Faculty of Computer Science
>Universitas Indonesia
>
>
>
>________________________________
> Dari: gaurav pant <[email protected]>
>Kepada: [email protected] 
>Dikirim: Selasa, 5 Maret 2013 12:10
>Judul: [Dbpedia-discussion] extraction problem
> 
>
>Hi All,
>
>Greeting for the day..
>
>I want to extract infobox properties and abstract from 
>(pages-articles.xml.bz2).I am able to download this file using command "../run 
>download config=download.de.properties"
>
>here I have configured file download.de.properties.file to download only 
>german page-article file.
>
>Now when i am trying to extract information out from it using "../run 
>extraction extraction.de.property" it is giving me below error. In 
>extraction.de.property I have mentioned dir properly , the same which I have 
>mentioned in download.de.properties file.
>
>Please let me know what wrong is going on?Is there any change need to be done 
>in pom.xml of cump dir.
>
>"
>[INFO] --- maven-scala-plugin:2.15.2:testCompile (test-compile) @ dump ---
>[INFO] Checking for multiple versions of scala
>[INFO] includes = [**/*.scala,**/*.java,]
>[INFO] excludes = []
>[WARNING] No source files found.
>[INFO] 
>[INFO] <<< maven-scala-plugin:2.15.2:run (default-cli) @ dump <<<
>[INFO] 
>[INFO] --- maven-scala-plugin:2.15.2:run (default-cli) @ dump ---
>[INFO] Checking for multiple versions of scala
>[INFO] launcher 'extraction' selected => 
>org.dbpedia.extraction.dump.extract.Extraction
>java.lang.reflect.InvocationTargetException
>    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>    at 
>sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>    at 
>sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>    at java.lang.reflect.Method.invoke(Method.java:601)
>    at org_scala_tools_maven_executions.MainHelper.runMain(MainHelper.java:161)
>    at 
>org_scala_tools_maven_executions.MainWithArgsInFile.main(MainWithArgsInFile.java:26)
>Caused by: java.lang.IllegalArgumentException: property 'base-dir' not defined.
>    at 
>org.dbpedia.extraction.dump.extract.ConfigParser.error(ConfigParser.scala:18)
>    at org.dbpedia.extraction.dump.extract.Config.<init>(Config.scala:26)
>    at 
>org.dbpedia.extraction.dump.extract.Extraction$.main(Extraction.scala:26)
>    at org.dbpedia.extraction.dump.extract.Extraction.main(Extraction.scala)
>    ... 6 more
>[INFO] ------------------------------------------------------------------------
>[INFO] BUILD FAILURE
>[INFO] ------------------------------------------------------------------------
>[INFO] Total time: 3.356s
>[INFO] Finished at: Tue Mar 05 04:52:35 UTC 2013
>[INFO] Final Memory: 8M/140M
>[INFO] ------------------------------------------------------------------------
>[ERROR] Failed to execute goal org.scala-tools:maven-scala-plugin:2.15.2:run 
>(default-cli) on project dump: wrap: org.apache.commons.exec.ExecuteException: 
>Process exited with an error: 240(Exit value: 240) -> [Help 1]
>[ERROR] 
>[ERROR] To see the full stack trace of the errors, re-run Maven with the -e 
>switch.
>[ERROR] Re-run Maven using the -X switch to enable full debug logging.
>[ERROR] 
>[ERROR] For more information about the errors and possible solutions, please 
>read the following articles:
>[ERROR] [Help 1] 
>http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException
>"
>
>contents of extraction.de.property
>
>"# download and extraction target dir
>dir=/mnt/ebs/perl/framework/extraction-framework/dump/wiki_dump
>
># Source file. If source file name ends with .gz or .bz2, it is unzipped on 
>the fly. 
># Must exist in the directory xxwiki/20121231 and have the prefix 
>xxwiki-20121231-.
> 
># default:
># source=pages-articles.xml
>
># alternatives:
>source=pages-articles.xml.bz2
># source=pages-articles.xml.gz
>
># use only directories that contain a 'download-complete' file? Default is 
>false.
>require-download-complete=true
>
># unqualified extractor class names are prefixed by 
>org.dbpedia.extraction.mappings.
>
># All 111 languages that as of 2012-05-25 have 10000 articles or more.
># TODO: parse wikipedias.csv and figure out from there which languages to 
>extract.
># If no languages are given, the ones having a mapping namespace on 
>mappings.dbpedia.org are used 
>languages=de
>
>extractors=InfoboxExtractor
>#ArticleCategoriesExtractor,CategoryLabelExtractor,ExternalLinksExtractor,\
>#GeoExtractor,InfoboxExtractor,LabelExtractor,PageIdExtractor,PageLinksExtractor,\
>#RedirectExtractor,RevisionIdExtractor,SkosCategoriesExtractor,WikiPageExtractor
>
>extractors.de=InfoboxExtractor
>#extractors.de=MappingExtractor,DisambiguationExtractor,InterLanguageLinksExtractor,RedirectExtractor,LabelExtractor
>#extractors.en=MappingExtractor,DisambiguationExtractor,InterLanguageLinksExtractor,RedirectExtractor,LabelExtractor
>
># if ontology and mapping files are not given or do not exist, download info 
>from mappings.dbpedia.org
>ontology=../ontology.xml
>mappings=../mappings
>
># URI policies. Allowed flags: uri, generic, xml-safe. Each flag may have on 
>of the suffixes
># -subjects, -predicates, -objects, -datatype, -context to match only URIs in 
>a certain position. 
># Without a suffix, a flag matches all URI positions.
>
>uri-policy.uri=uri:en; generic:en; xml-safe-predicates:*
>uri-policy.iri=generic:en; xml-safe-predicates:*
>
>
># File formats. Allowed flags: n-triples, n-quads, turtle-triples, 
>turtle-quads, trix-triples, trix-quads
># May be followed by a semicolon and a URI policy name. If format name ends 
>with .gz or .bz2, files
># are zipped on the fly.
>
># NT is unreadable anyway - might as well use URIs
>format.nt=n-triples;uri-policy.uri
>#format.nq.gz=n-quads;uri-policy.uri
>
># Turtle is much more readable - use nice IRIs
>format.ttl=turtle-triples;uri-policy.iri
>#format.tql.gz=turtle-quads;uri-policy.iri
>"
>
>-- 
>Regards
>Gaurav Pant
>+91-7709196607,+91-9405757794
>
>
>------------------------------------------------------------------------------
>Everyone hates slow websites. So do we.
>Make your web apps faster with AppDynamics
>Download AppDynamics Lite for free today:
>http://p.sf.net/sfu/appdyn_d2d_feb
>_______________________________________________
>Dbpedia-discussion mailing list
>[email protected]
>https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion
>
>
>


-- 
Regards
Gaurav Pant
+91-7709196607,+91-9405757794

------------------------------------------------------------------------------
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://p.sf.net/sfu/appdyn_d2d_feb

_______________________________________________
Dbpedia-discussion mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion

[Dbpedia-discussion] Bls: extraction problem

Reply via email to