Hello Max,
Thank you for your prompt reply! After switching to the dump branch and
changing the namespace (and executing mvn clean and install on project root),
the following is displayed:
mvn scala:run "-Dlauncher=extraction" "-DaddArgs=extraction.properties"
[INFO] Scanning for projects...
[INFO] ------------------------------------------------------------------------
[INFO] Building DBpedia Dump Extraction
[INFO] task-segment: [scala:run]
[INFO] ------------------------------------------------------------------------
[INFO] Preparing scala:run
[INFO] [resources:resources {execution: default-resources}]
[INFO] Using 'UTF-8' encoding to copy filtered resources.
[INFO] skip non existing resourceDirectory
/home/mmlab/DBpedia_Extraction_Framework/extraction_framework/dump/src/main/resources
[INFO] [scala:compile {execution: process-resources}]
[INFO] Checking for multiple versions of scala
[INFO] includes = [**/*.scala,**/*.java,]
[INFO] excludes = []
[INFO] Nothing to compile - all classes are up to date
[INFO] [compiler:compile {execution: default-compile}]
[INFO] Nothing to compile - all classes are up to date
[INFO] [scala:compile {execution: compile}]
[INFO] Checking for multiple versions of scala
[INFO] includes = [**/*.scala,**/*.java,]
[INFO] excludes = []
[INFO] Nothing to compile - all classes are up to date
[INFO] [resources:testResources {execution: default-testResources}]
[INFO] Using 'UTF-8' encoding to copy filtered resources.
[INFO] skip non existing resourceDirectory
/home/mmlab/DBpedia_Extraction_Framework/extraction_framework/dump/src/test/resources
[INFO] [compiler:testCompile {execution: default-testCompile}]
[INFO] No sources to compile
[INFO] [scala:testCompile {execution: test-compile}]
[INFO] Checking for multiple versions of scala
[INFO] includes = [**/*.scala,**/*.java,]
[INFO] excludes = []
[WARNING] No source files found.
[INFO] [scala:run {execution: default-cli}]
[INFO] Checking for multiple versions of scala
[INFO] launcher 'extraction' selected =>
org.dbpedia.extraction.dump.extract.Extraction
Sep 20, 2012 9:43:40 AM org.dbpedia.extraction.mappings.Redirects$ loadFromCache
INFO: Loading redirects from cache file
/home/mmlab/wikipedia/nlwiki/20120824/nlwiki-20120824-template-redirects.obj
Sep 20, 2012 9:43:40 AM org.dbpedia.extraction.mappings.Redirects$ load
INFO: Will extract redirects from source for nl wiki, could not load cache file
'/home/mmlab/wikipedia/nlwiki/20120824/nlwiki-20120824-template-redirects.obj':
java.io.FileNotFoundException:
/home/mmlab/wikipedia/nlwiki/20120824/nlwiki-20120824-template-redirects.obj
(No such file or directory)
Sep 20, 2012 9:43:40 AM org.dbpedia.extraction.mappings.Redirects$
loadFromSource
INFO: Loading redirects from source (nl)
java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:616)
at
org_scala_tools_maven_executions.MainHelper.runMain(MainHelper.java:161)
at
org_scala_tools_maven_executions.MainWithArgsInFile.main(MainWithArgsInFile.java:26)
Caused by: java.lang.IllegalArgumentException: Illegal pattern character 'X'
at java.text.SimpleDateFormat.compile(SimpleDateFormat.java:769)
at java.text.SimpleDateFormat.initialize(SimpleDateFormat.java:576)
at java.text.SimpleDateFormat.<init>(SimpleDateFormat.java:501)
at java.text.SimpleDateFormat.<init>(SimpleDateFormat.java:476)
at
org.dbpedia.extraction.util.StringUtils$$anon$1.initialValue(StringUtils.scala:16)
at
org.dbpedia.extraction.util.StringUtils$$anon$1.initialValue(StringUtils.scala:13)
at java.lang.ThreadLocal.setInitialValue(ThreadLocal.java:160)
at java.lang.ThreadLocal.get(ThreadLocal.java:150)
at
org.dbpedia.extraction.util.StringUtils$.parseTimestamp(StringUtils.scala:31)
at org.dbpedia.extraction.sources.WikiPage.<init>(WikiPage.scala:24)
at
org.dbpedia.extraction.sources.WikipediaDumpParser.readRevision(WikipediaDumpParser.java:351)
at
org.dbpedia.extraction.sources.WikipediaDumpParser.readPage(WikipediaDumpParser.java:245)
at
org.dbpedia.extraction.sources.WikipediaDumpParser.readPages(WikipediaDumpParser.java:185)
at
org.dbpedia.extraction.sources.WikipediaDumpParser.readDump(WikipediaDumpParser.java:143)
at
org.dbpedia.extraction.sources.WikipediaDumpParser.run(WikipediaDumpParser.java:114)
at
org.dbpedia.extraction.sources.XMLReaderSource.foreach(XMLSource.scala:64)
at
scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:239)
at
org.dbpedia.extraction.sources.XMLReaderSource.flatMap(XMLSource.scala:60)
at
org.dbpedia.extraction.mappings.Redirects$.loadFromSource(Redirects.scala:165)
at org.dbpedia.extraction.mappings.Redirects$.load(Redirects.scala:116)
at
org.dbpedia.extraction.dump.extract.ConfigLoader$$anon$1.<init>(ConfigLoader.scala:96)
at
org.dbpedia.extraction.dump.extract.ConfigLoader.org$dbpedia$extraction$dump$extract$ConfigLoader$$createExtractionJob(ConfigLoader.scala:51)
at
org.dbpedia.extraction.dump.extract.ConfigLoader$$anonfun$getExtractionJobs$1.apply(ConfigLoader.scala:36)
at
org.dbpedia.extraction.dump.extract.ConfigLoader$$anonfun$getExtractionJobs$1.apply(ConfigLoader.scala:36)
at scala.collection.Iterator$$anon$19.next(Iterator.scala:401)
at scala.collection.Iterator$class.foreach(Iterator.scala:772)
at scala.collection.Iterator$$anon$19.foreach(Iterator.scala:399)
at
scala.collection.IterableViewLike$Transformed$class.foreach(IterableViewLike.scala:41)
at
scala.collection.IterableViewLike$$anon$3.foreach(IterableViewLike.scala:80)
at
org.dbpedia.extraction.dump.extract.Extraction$.main(Extraction.scala:29)
at org.dbpedia.extraction.dump.extract.Extraction.main(Extraction.scala)
... 6 more
Hopefully you can provide me some more details about what's going wrong.
Regarding the templates and titles used for disambiguation pages:
[8] mentions the following templates being used: Dp, dpintro, DP, Disambig.
However, when running some tests on the nlWiki article dump, it seems that the
Disambig template does not seem to be used. Disambiguation pages in Dutch often
do not have an indication between parentheses that the page is a disambiguation
page (e.g., "title (doorverwijzing)"). Therefore, I would expect that searching
for disambiguation templates on a Wiki page is sufficient. Can anyone provide
me some more details about the TODO mentioned in [5]?
[8] http://nl.wikipedia.org/wiki/MediaWiki:Disambiguationspage
-----Original Message-----
From: Max Jakob [mailto:[email protected]]
Sent: Wednesday, September 19, 2012 4:45 PM
To: Pedro Debevere
Cc: [email protected]
Subject: Re: [Dbpedia-discussion] DBpedia Extraction Framework Dutch
disambiguation data set
Hi,
On Wed, Sep 19, 2012 at 3:46 PM, Pedro Debevere <[email protected]> wrote:
> I’m interested in creating a Dutch port of DBpedia Spotlight. In order
> to do this, I need a disambiguation data set for Dutch. This data set
> is currently not available for download. However, based on some
> messages posted here [1], I suspect that the latest version of the extraction
> framework supports this.
> Is this correct?
Generally yes, if all names of disambiguation templates are specified in [4].
Please also note that there seems to be an issue with multiple names for
disambiguation page titles in dutch. See the TODO in [5].
> As a workaround I downloaded unpacked the nl-pages-articles.xml file
> myself
On your first attempt, it looks like something goes wrong during download. So
downloading and unpacking yourself was a good idea.
> Message: expected <mediawiki> with namespace
> [http://www.mediawiki.org/xml/export-0.6/], found
> [http://www.mediawiki.org/xml/export-0.7/]
Wikipedia seems to have changed its export format version from 0.6 to 0.7. The
DBpedia parser should still be able to parse the dump, assuming the changes
mentioned in [6]. You can try to switch to the dump branch (currently the
stable one) and change the line in [7] to
private final String _namespace = "http://www.mediawiki.org/xml/export-0.7/";
and try again. (Call mvn clean install on the project root before).
Cheers,
Max
[4]
http://dbpedia.hg.sourceforge.net/hgweb/dbpedia/extraction_framework/file/2a322c5c6692/core/src/main/scala/org/dbpedia/extraction/wikiparser/impl/wikipedia/Disambiguation.scala#l165
[5]
http://dbpedia.hg.sourceforge.net/hgweb/dbpedia/extraction_framework/file/2a322c5c6692/core/src/main/scala/org/dbpedia/extraction/config/mappings/DisambiguationExtractorConfig.scala#l16
[6] http://www.mediawiki.org/xml/export-0.7.xsd
[7]
http://dbpedia.hg.sourceforge.net/hgweb/dbpedia/extraction_framework/file/2a322c5c6692/core/src/main/java/org/dbpedia/extraction/sources/WikipediaDumpParser.java#l74
No virus found in this incoming message.
Checked by AVG - www.avg.com
Version: 8.5.455 / Virus Database: 271.1.1/5265 - Release Date: 09/18/12
19:47:00
------------------------------------------------------------------------------
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://ad.doubleclick.net/clk;258768047;13503038;j?
http://info.appdynamics.com/FreeJavaPerformanceDownload.html
_______________________________________________
Dbpedia-discussion mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion