Hi All,
I want to extract Abstract from page_article dump using dbpedia_extracter .
But some of the pages there are no proper abstract. Like some says
redirecting to some other page. or some have some other non-required
information.
Is there any possibility to get cleaner abstract.
After analyzing I come across below.
if <text>...</text> tag has #REDIRECT|#redirect than those are redirected
pages.
If anyone has some other ideas than please suggest me.
Thanks
On Tue, Mar 5, 2013 at 11:36 AM, riko adi prasetya
<[email protected]>wrote:
> Hi Gaurav,
>
> Be patient,
> I spent 4 hours for extracting Indonesian data dump.
> I think it is depend on host spec and size of data dump.
>
> Yes, extracted triplets in the same source directory.
>
> Cheers,
> Riko
>
> ------------------------------
> *Dari:* gaurav pant <[email protected]>
> *Kepada:* riko adi prasetya <[email protected]>
> *Dikirim:* Selasa, 5 Maret 2013 12:38
> *Judul:* Re: [Dbpedia-discussion] extraction problem
>
> Hi Riko,
>
> Thanks for your reply..i have tried with that change. Its running but from
> a long waiting at
>
> "
> Mar 05, 2013 5:13:30 AM org.dbpedia.extraction.mappings.Redirects$
> loadFromCache
> INFO: Loading redirects from cache file
> /mnt/ebs/perl/framework/extraction-framework/dump/wiki_dump/dewiki/20130219/dewiki-20130219-template-redirects.obj
> Mar 05, 2013 5:13:30 AM org.dbpedia.extraction.mappings.Redirects$ load
> INFO: Will extract redirects from source for de wiki, could not load cache
> file
> '/mnt/ebs/perl/framework/extraction-framework/dump/wiki_dump/dewiki/20130219/dewiki-20130219-template-redirects.obj':
> java.io.FileNotFoundException:
> /mnt/ebs/perl/framework/extraction-framework/dump/wiki_dump/dewiki/20130219/dewiki-20130219-template-redirects.obj
> (No such file or directory)
> Mar 05, 2013 5:13:30 AM org.dbpedia.extraction.mappings.Redirects$
> loadFromSource
> INFO: Loading redirects from source (de)
> Mar 05, 2013 5:28:58 AM
> org.dbpedia.extraction.mappings.Redirects$RedirectFinder apply
> WARNING: wrong redirect. page:
> [title=Mikrogramm;ns=0/Main/;language:wiki=de,locale=de].
> found by dbpedia: [title=Gramm;ns=0/Main/;language:wiki=de,locale=de].
> found by wikipedia: [null]
> "
>
> Is it because I have downloaded file page-article file manually not using
> dbpedia-extraction and due to this other required file could not be
> downloaded?
>
> Also where it will give extracted triplets...in the same source directory?
>
>
> On Tue, Mar 5, 2013 at 10:56 AM, riko adi prasetya <
> [email protected]> wrote:
>
> Hi Gaurav,
>
> Try to check again your *extraction.de.property*
> *
> *
> *"# download and extraction target dir
> dir=/mnt/ebs/perl/framework/extraction-framework/dump/wiki_dump
>
> # Source file. If source file name ends with .gz or .bz2, it is unzipped
> on the fly.
> # Must exist in the directory xxwiki/20121231 and have the prefix
> xxwiki-20121231-.
>
> # default:
> # source=pages-articles.xml
>
> # alternatives:
> source=pages-articles.xml.bz2
> # source=pages-articles.xml.gz
>
> # use only directories that contain a 'download-complete' file? Default is
> false.
> require-download-complete=true
>
> # unqualified extractor class names are prefixed by
> org.dbpedia.extraction.mappings.
>
> # All 111 languages that as of 2012-05-25 have 10000 articles or more.
> # TODO: parse wikipedias.csv and figure out from there which languages to
> extract.
> # If no languages are given, the ones having a mapping namespace on
> mappings.dbpedia.org are used
> languages=de
>
> extractors=InfoboxExtractor
> #ArticleCategoriesExtractor,CategoryLabelExtractor,ExternalLinksExtractor,\
>
> #GeoExtractor,InfoboxExtractor,LabelExtractor,PageIdExtractor,PageLinksExtractor,\
>
> #RedirectExtractor,RevisionIdExtractor,SkosCategoriesExtractor,WikiPageExtractor
>
> extractors.de=InfoboxExtractor
> #extractors.de
> =MappingExtractor,DisambiguationExtractor,InterLanguageLinksExtractor,RedirectExtractor,LabelExtractor
>
> #extractors.en=MappingExtractor,DisambiguationExtractor,InterLanguageLinksExtractor,RedirectExtractor,LabelExtractor
>
> # if ontology and mapping files are not given or do not exist, download
> info from mappings.dbpedia.org
> ontology=../ontology.xml
> mappings=../mappings
>
> # URI policies. Allowed flags: uri, generic, xml-safe. Each flag may have
> on of the suffixes
> # -subjects, -predicates, -objects, -datatype, -context to match only URIs
> in a certain position.
> # Without a suffix, a flag matches all URI positions.
>
> uri-policy.uri=uri:en; generic:en; xml-safe-predicates:*
> uri-policy.iri=generic:en; xml-safe-predicates:*
>
>
> # File formats. Allowed flags: n-triples, n-quads, turtle-triples,
> turtle-quads, trix-triples, trix-quads
> # May be followed by a semicolon and a URI policy name. If format name
> ends with .gz or .bz2, files
> # are zipped on the fly.
>
> # NT is unreadable anyway - might as well use URIs
> format.nt=n-triples;uri-policy.uri
> #format.nq.gz=n-quads;uri-policy.uri
>
> # Turtle is much more readable - use nice IRIs
> format.ttl=turtle-triples;uri-policy.iri
> #format.tql.gz=turtle-quads;uri-policy.iri
> "
> *
> *
> *
> You write dir, so there is not base-dir in your extraction configuration.
>
> Cheers,
> Riko
>
> ------------------------------
> Riko Adi Prasetya
> Faculty of Computer Science
> Universitas Indonesia
>
> ------------------------------
> *Dari:* gaurav pant <[email protected]>
> *Kepada:* [email protected]
> *Dikirim:* Selasa, 5 Maret 2013 12:10
> *Judul:* [Dbpedia-discussion] extraction problem
>
> Hi All,
>
> Greeting for the day..
>
> I want to extract infobox properties and abstract from
> (pages-articles.xml.bz2).I am able to download this file using command
> "../run download config=download.de.properties"
>
> here I have configured file download.de.properties.file to download only
> german page-article file.
>
> Now when i am trying to extract information out from it using "../run
> extraction extraction.de.property" it is giving me below error. In
> *extraction.de.property
> *I have mentioned dir properly , the same which I have mentioned in
> download.de.properties file.
>
> Please let me know what wrong is going on?Is there any change need to be
> done in pom.xml of cump dir.
>
> "
> [INFO] --- maven-scala-plugin:2.15.2:testCompile (test-compile) @ dump ---
> [INFO] Checking for multiple versions of scala
> [INFO] includes = [**/*.scala,**/*.java,]
> [INFO] excludes = []
> [WARNING] No source files found.
> [INFO]
> [INFO] <<< maven-scala-plugin:2.15.2:run (default-cli) @ dump <<<
> [INFO]
> [INFO] --- maven-scala-plugin:2.15.2:run (default-cli) @ dump ---
> [INFO] Checking for multiple versions of scala
> [INFO] launcher 'extraction' selected =>
> org.dbpedia.extraction.dump.extract.Extraction
> java.lang.reflect.InvocationTargetException
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:601)
> at
> org_scala_tools_maven_executions.MainHelper.runMain(MainHelper.java:161)
> at
> org_scala_tools_maven_executions.MainWithArgsInFile.main(MainWithArgsInFile.java:26)
> *Caused by: java.lang.IllegalArgumentException: property 'base-dir' not
> defined.*
> at
> org.dbpedia.extraction.dump.extract.ConfigParser.error(ConfigParser.scala:18)
> at org.dbpedia.extraction.dump.extract.Config.<init>(Config.scala:26)
> at
> org.dbpedia.extraction.dump.extract.Extraction$.main(Extraction.scala:26)
> at
> org.dbpedia.extraction.dump.extract.Extraction.main(Extraction.scala)
> ... 6 more
> [INFO]
> ------------------------------------------------------------------------
> [INFO] BUILD FAILURE
> [INFO]
> ------------------------------------------------------------------------
> [INFO] Total time: 3.356s
> [INFO] Finished at: Tue Mar 05 04:52:35 UTC 2013
> [INFO] Final Memory: 8M/140M
> [INFO]
> ------------------------------------------------------------------------
> [ERROR] Failed to execute goal
> org.scala-tools:maven-scala-plugin:2.15.2:run (default-cli) on project
> dump: wrap: org.apache.commons.exec.ExecuteException: Process exited with
> an error: 240(Exit value: 240) -> [Help 1]
> [ERROR]
> [ERROR] To see the full stack trace of the errors, re-run Maven with the
> -e switch.
> [ERROR] Re-run Maven using the -X switch to enable full debug logging.
> [ERROR]
> [ERROR] For more information about the errors and possible solutions,
> please read the following articles:
> [ERROR] [Help 1]
> http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException
> "
> *
> contents of extraction.de.property*
>
> "# download and extraction target dir
> dir=/mnt/ebs/perl/framework/extraction-framework/dump/wiki_dump
>
> # Source file. If source file name ends with .gz or .bz2, it is unzipped
> on the fly.
> # Must exist in the directory xxwiki/20121231 and have the prefix
> xxwiki-20121231-.
>
> # default:
> # source=pages-articles.xml
>
> # alternatives:
> source=pages-articles.xml.bz2
> # source=pages-articles.xml.gz
>
> # use only directories that contain a 'download-complete' file? Default is
> false.
> require-download-complete=true
>
> # unqualified extractor class names are prefixed by
> org.dbpedia.extraction.mappings.
>
> # All 111 languages that as of 2012-05-25 have 10000 articles or more.
> # TODO: parse wikipedias.csv and figure out from there which languages to
> extract.
> # If no languages are given, the ones having a mapping namespace on
> mappings.dbpedia.org are used
> languages=de
>
> extractors=InfoboxExtractor
> #ArticleCategoriesExtractor,CategoryLabelExtractor,ExternalLinksExtractor,\
>
> #GeoExtractor,InfoboxExtractor,LabelExtractor,PageIdExtractor,PageLinksExtractor,\
>
> #RedirectExtractor,RevisionIdExtractor,SkosCategoriesExtractor,WikiPageExtractor
>
> extractors.de=InfoboxExtractor
> #extractors.de
> =MappingExtractor,DisambiguationExtractor,InterLanguageLinksExtractor,RedirectExtractor,LabelExtractor
>
> #extractors.en=MappingExtractor,DisambiguationExtractor,InterLanguageLinksExtractor,RedirectExtractor,LabelExtractor
>
> # if ontology and mapping files are not given or do not exist, download
> info from mappings.dbpedia.org
> ontology=../ontology.xml
> mappings=../mappings
>
> # URI policies. Allowed flags: uri, generic, xml-safe. Each flag may have
> on of the suffixes
> # -subjects, -predicates, -objects, -datatype, -context to match only URIs
> in a certain position.
> # Without a suffix, a flag matches all URI positions.
>
> uri-policy.uri=uri:en; generic:en; xml-safe-predicates:*
> uri-policy.iri=generic:en; xml-safe-predicates:*
>
>
> # File formats. Allowed flags: n-triples, n-quads, turtle-triples,
> turtle-quads, trix-triples, trix-quads
> # May be followed by a semicolon and a URI policy name. If format name
> ends with .gz or .bz2, files
> # are zipped on the fly.
>
> # NT is unreadable anyway - might as well use URIs
> format.nt=n-triples;uri-policy.uri
> #format.nq.gz=n-quads;uri-policy.uri
>
> # Turtle is much more readable - use nice IRIs
> format.ttl=turtle-triples;uri-policy.iri
> #format.tql.gz=turtle-quads;uri-policy.iri
> "
>
> --
> Regards
> Gaurav Pant
> +91-7709196607,+91-9405757794
>
>
> ------------------------------------------------------------------------------
> Everyone hates slow websites. So do we.
> Make your web apps faster with AppDynamics
> Download AppDynamics Lite for free today:
> http://p.sf.net/sfu/appdyn_d2d_feb
> _______________________________________________
> Dbpedia-discussion mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion
>
>
>
>
>
> --
> Regards
> Gaurav Pant
> +91-7709196607,+91-9405757794
>
>
>
--
Regards
Gaurav Pant
+91-7709196607,+91-9405757794
------------------------------------------------------------------------------
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://p.sf.net/sfu/appdyn_d2d_feb
_______________________________________________
Dbpedia-discussion mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion