Hello Max,

Thank you for your prompt reply! After switching to the dump branch and 
changing the namespace (and executing mvn clean and install on project root), 
the following is displayed:

mvn scala:run "-Dlauncher=extraction" "-DaddArgs=extraction.properties"
[INFO] Scanning for projects...
[INFO] ------------------------------------------------------------------------
[INFO] Building DBpedia Dump Extraction
[INFO]    task-segment: [scala:run]
[INFO] ------------------------------------------------------------------------
[INFO] Preparing scala:run
[INFO] [resources:resources {execution: default-resources}]
[INFO] Using 'UTF-8' encoding to copy filtered resources.
[INFO] skip non existing resourceDirectory 
/home/mmlab/DBpedia_Extraction_Framework/extraction_framework/dump/src/main/resources
[INFO] [scala:compile {execution: process-resources}]
[INFO] Checking for multiple versions of scala
[INFO] includes = [**/*.scala,**/*.java,]
[INFO] excludes = []
[INFO] Nothing to compile - all classes are up to date
[INFO] [compiler:compile {execution: default-compile}]
[INFO] Nothing to compile - all classes are up to date
[INFO] [scala:compile {execution: compile}]
[INFO] Checking for multiple versions of scala
[INFO] includes = [**/*.scala,**/*.java,]
[INFO] excludes = []
[INFO] Nothing to compile - all classes are up to date
[INFO] [resources:testResources {execution: default-testResources}]
[INFO] Using 'UTF-8' encoding to copy filtered resources.
[INFO] skip non existing resourceDirectory 
/home/mmlab/DBpedia_Extraction_Framework/extraction_framework/dump/src/test/resources
[INFO] [compiler:testCompile {execution: default-testCompile}]
[INFO] No sources to compile
[INFO] [scala:testCompile {execution: test-compile}]
[INFO] Checking for multiple versions of scala
[INFO] includes = [**/*.scala,**/*.java,]
[INFO] excludes = []
[WARNING] No source files found.
[INFO] [scala:run {execution: default-cli}]
[INFO] Checking for multiple versions of scala
[INFO] launcher 'extraction' selected => 
org.dbpedia.extraction.dump.extract.Extraction
Sep 20, 2012 9:43:40 AM org.dbpedia.extraction.mappings.Redirects$ loadFromCache
INFO: Loading redirects from cache file 
/home/mmlab/wikipedia/nlwiki/20120824/nlwiki-20120824-template-redirects.obj
Sep 20, 2012 9:43:40 AM org.dbpedia.extraction.mappings.Redirects$ load
INFO: Will extract redirects from source for nl wiki, could not load cache file 
'/home/mmlab/wikipedia/nlwiki/20120824/nlwiki-20120824-template-redirects.obj': 
java.io.FileNotFoundException: 
/home/mmlab/wikipedia/nlwiki/20120824/nlwiki-20120824-template-redirects.obj 
(No such file or directory)
Sep 20, 2012 9:43:40 AM org.dbpedia.extraction.mappings.Redirects$ 
loadFromSource
INFO: Loading redirects from source (nl)
java.lang.reflect.InvocationTargetException
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:616)
        at 
org_scala_tools_maven_executions.MainHelper.runMain(MainHelper.java:161)
        at 
org_scala_tools_maven_executions.MainWithArgsInFile.main(MainWithArgsInFile.java:26)
Caused by: java.lang.IllegalArgumentException: Illegal pattern character 'X'
        at java.text.SimpleDateFormat.compile(SimpleDateFormat.java:769)
        at java.text.SimpleDateFormat.initialize(SimpleDateFormat.java:576)
        at java.text.SimpleDateFormat.<init>(SimpleDateFormat.java:501)
        at java.text.SimpleDateFormat.<init>(SimpleDateFormat.java:476)
        at 
org.dbpedia.extraction.util.StringUtils$$anon$1.initialValue(StringUtils.scala:16)
        at 
org.dbpedia.extraction.util.StringUtils$$anon$1.initialValue(StringUtils.scala:13)
        at java.lang.ThreadLocal.setInitialValue(ThreadLocal.java:160)
        at java.lang.ThreadLocal.get(ThreadLocal.java:150)
        at 
org.dbpedia.extraction.util.StringUtils$.parseTimestamp(StringUtils.scala:31)
        at org.dbpedia.extraction.sources.WikiPage.<init>(WikiPage.scala:24)
        at 
org.dbpedia.extraction.sources.WikipediaDumpParser.readRevision(WikipediaDumpParser.java:351)
        at 
org.dbpedia.extraction.sources.WikipediaDumpParser.readPage(WikipediaDumpParser.java:245)
        at 
org.dbpedia.extraction.sources.WikipediaDumpParser.readPages(WikipediaDumpParser.java:185)
        at 
org.dbpedia.extraction.sources.WikipediaDumpParser.readDump(WikipediaDumpParser.java:143)
        at 
org.dbpedia.extraction.sources.WikipediaDumpParser.run(WikipediaDumpParser.java:114)
        at 
org.dbpedia.extraction.sources.XMLReaderSource.foreach(XMLSource.scala:64)
        at 
scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:239)
        at 
org.dbpedia.extraction.sources.XMLReaderSource.flatMap(XMLSource.scala:60)
        at 
org.dbpedia.extraction.mappings.Redirects$.loadFromSource(Redirects.scala:165)
        at org.dbpedia.extraction.mappings.Redirects$.load(Redirects.scala:116)
        at 
org.dbpedia.extraction.dump.extract.ConfigLoader$$anon$1.<init>(ConfigLoader.scala:96)
        at 
org.dbpedia.extraction.dump.extract.ConfigLoader.org$dbpedia$extraction$dump$extract$ConfigLoader$$createExtractionJob(ConfigLoader.scala:51)
        at 
org.dbpedia.extraction.dump.extract.ConfigLoader$$anonfun$getExtractionJobs$1.apply(ConfigLoader.scala:36)
        at 
org.dbpedia.extraction.dump.extract.ConfigLoader$$anonfun$getExtractionJobs$1.apply(ConfigLoader.scala:36)
        at scala.collection.Iterator$$anon$19.next(Iterator.scala:401)
        at scala.collection.Iterator$class.foreach(Iterator.scala:772)
        at scala.collection.Iterator$$anon$19.foreach(Iterator.scala:399)
        at 
scala.collection.IterableViewLike$Transformed$class.foreach(IterableViewLike.scala:41)
        at 
scala.collection.IterableViewLike$$anon$3.foreach(IterableViewLike.scala:80)
        at 
org.dbpedia.extraction.dump.extract.Extraction$.main(Extraction.scala:29)
        at org.dbpedia.extraction.dump.extract.Extraction.main(Extraction.scala)
        ... 6 more

Hopefully you can provide me some more details about what's going wrong.

Regarding the templates and titles used for disambiguation pages: 
[8] mentions the following templates being used: Dp, dpintro, DP, Disambig. 
However, when running some tests on the nlWiki article dump, it seems that the 
Disambig template does not seem to be used. Disambiguation pages in Dutch often 
do not have an indication between parentheses that the page is a disambiguation 
page (e.g., "title (doorverwijzing)"). Therefore, I would expect that searching 
for disambiguation templates on a Wiki page is sufficient. Can anyone provide 
me some more details about the TODO mentioned in [5]?


[8] http://nl.wikipedia.org/wiki/MediaWiki:Disambiguationspage


-----Original Message-----
From: Max Jakob [mailto:[email protected]] 
Sent: Wednesday, September 19, 2012 4:45 PM
To: Pedro Debevere
Cc: [email protected]
Subject: Re: [Dbpedia-discussion] DBpedia Extraction Framework Dutch 
disambiguation data set

Hi,

On Wed, Sep 19, 2012 at 3:46 PM, Pedro Debevere <[email protected]> wrote:
> I’m interested in creating a Dutch port of DBpedia Spotlight. In order 
> to do this, I need a disambiguation data set for Dutch. This data set 
> is currently not available for download. However, based on some 
> messages posted here [1], I suspect that the latest version of the extraction 
> framework supports this.
> Is this correct?

Generally yes, if all names of disambiguation templates are specified in [4]. 
Please also note that there seems to be an issue with multiple names for 
disambiguation page titles in dutch. See the TODO in [5].


> As a workaround I downloaded unpacked the nl-pages-articles.xml file 
> myself

On your first attempt, it looks like something goes wrong during download. So 
downloading and unpacking yourself was a good idea.


> Message: expected <mediawiki> with namespace 
> [http://www.mediawiki.org/xml/export-0.6/], found 
> [http://www.mediawiki.org/xml/export-0.7/]

Wikipedia seems to have changed its export format version from 0.6 to 0.7. The 
DBpedia parser should still be able to parse the dump, assuming the changes 
mentioned in [6]. You can try to switch to the dump branch (currently the 
stable one) and change the line in [7] to

  private final String _namespace = "http://www.mediawiki.org/xml/export-0.7/";;

and try again. (Call  mvn clean install  on the project root before).


Cheers,
Max

[4] 
http://dbpedia.hg.sourceforge.net/hgweb/dbpedia/extraction_framework/file/2a322c5c6692/core/src/main/scala/org/dbpedia/extraction/wikiparser/impl/wikipedia/Disambiguation.scala#l165
[5] 
http://dbpedia.hg.sourceforge.net/hgweb/dbpedia/extraction_framework/file/2a322c5c6692/core/src/main/scala/org/dbpedia/extraction/config/mappings/DisambiguationExtractorConfig.scala#l16
[6] http://www.mediawiki.org/xml/export-0.7.xsd
[7] 
http://dbpedia.hg.sourceforge.net/hgweb/dbpedia/extraction_framework/file/2a322c5c6692/core/src/main/java/org/dbpedia/extraction/sources/WikipediaDumpParser.java#l74

No virus found in this incoming message.
Checked by AVG - www.avg.com
Version: 8.5.455 / Virus Database: 271.1.1/5265 - Release Date: 09/18/12 
19:47:00


------------------------------------------------------------------------------
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://ad.doubleclick.net/clk;258768047;13503038;j?
http://info.appdynamics.com/FreeJavaPerformanceDownload.html
_______________________________________________
Dbpedia-discussion mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion

Reply via email to