Hi,
I've read the internationalization guide (which is very informative) and
have the following additional remarks/questions:
- I think that it can be beneficial to combine both the
instance_type_$lang and instance_type_en data sets in order to obtain more
instance type information. The instance_type_nl.nt data set for example
currently contains only a small amount of instance type information (due to
the small amount of mappings currently defined for the Dutch version of
Wikipedia). This can then be done for example by checking resources in the
canonical DBpedia data set that have a owl:sameAs link with a localized
version.
- The .nt data sets of the canonicalized DBpedia use URI's (URL
encoded), whereas the .nt files of the localized editions seem to use IRI's
(Unicode escape characters). Could this lead to any problems during index
generation?
- The list of Lucene's Analyzers listed on the Internationalization
page does seem to list only subclasses of
org.apache.lucene.analysis.StopwordAnalyzerBase. Should this list be
extended to include all subclasses of org.apache.lucene.analyser.Analyzer
(e.g., DutchAnalyzer, which is a subclass of
org.apache.lucene.analysis.ReusableAnalyzerBase) or are these the only
useable analyzers in Spotlight?
- Can the amount of memory needed to generate the index be reduced
by setting the minNumDocsBeforeFlush variable in
IndexMergedOccurrences.scala to a smaller value?
- Maybe the remark posted here [1] (it this still is the case) is
also relevant and therefore should maybe also be mentioned on the
Internationalization page.
[1]
http://www.mail-archive.com/[email protected]/msg0300
5.html
From: Pablo N. Mendes [mailto:[email protected]]
Sent: Wednesday, September 19, 2012 5:08 PM
To: Max Jakob
Cc: Pedro Debevere; [email protected]; Dimitris
Kontokostas; [email protected]
Subject: Re: [Dbpedia-discussion] DBpedia Extraction Framework Dutch
disambiguation data set
Hi Pedro,
Whenever you get all the DBpedia datasets you need, it's time to run DBpedia
Spotlight indexing.
You will be glad to know that we've been working on a step-by-step guide to
build DBpedia Spotlight for other languages. Check this out:
https://github.com/dbpedia-spotlight/dbpedia-spotlight/wiki/Internationaliza
tion
You should also coordinate with Dimitris Kontokostas, who has been up to now
my contact for the Dutch DBpedia, and the one I had been including in the
DBpedia Spotlight i18n thread. Perhaps you can help each other out.
We hope to have a much improved (more automated) indexing process in the
next couple of days, so keep in touch.
Please join dbp-spotlight-users for questions about DBpedia Spotlight.
https://lists.sourceforge.net/lists/listinfo/dbp-spotlight-users
Cheers,
Pablo
On Wed, Sep 19, 2012 at 4:45 PM, Max Jakob <[email protected]> wrote:
Hi,
On Wed, Sep 19, 2012 at 3:46 PM, Pedro Debevere <[email protected]>
wrote:
> I'm interested in creating a Dutch port of DBpedia Spotlight. In order to
do
> this, I need a disambiguation data set for Dutch. This data set is
currently
> not available for download. However, based on some messages posted here
[1],
> I suspect that the latest version of the extraction framework supports
this.
> Is this correct?
Generally yes, if all names of disambiguation templates are specified
in [4]. Please also note that there seems to be an issue with multiple
names for disambiguation page titles in dutch. See the TODO in [5].
> As a workaround I downloaded unpacked the nl-pages-articles.xml file
myself
On your first attempt, it looks like something goes wrong during
download. So downloading and unpacking yourself was a good idea.
> Message: expected <mediawiki> with namespace
> [http://www.mediawiki.org/xml/export-0.6/], found
> [http://www.mediawiki.org/xml/export-0.7/]
Wikipedia seems to have changed its export format version from 0.6 to
0.7. The DBpedia parser should still be able to parse the dump,
assuming the changes mentioned in [6]. You can try to switch to the
dump branch (currently the stable one) and change the line in [7] to
private final String _namespace =
"http://www.mediawiki.org/xml/export-0.7/";
and try again. (Call mvn clean install on the project root before).
Cheers,
Max
[4]
http://dbpedia.hg.sourceforge.net/hgweb/dbpedia/extraction_framework/file/2a
322c5c6692/core/src/main/scala/org/dbpedia/extraction/wikiparser/impl/wikipe
dia/Disambiguation.scala#l165
[5]
http://dbpedia.hg.sourceforge.net/hgweb/dbpedia/extraction_framework/file/2a
322c5c6692/core/src/main/scala/org/dbpedia/extraction/config/mappings/Disamb
iguationExtractorConfig.scala#l16
[6] http://www.mediawiki.org/xml/export-0.7.xsd
[7]
http://dbpedia.hg.sourceforge.net/hgweb/dbpedia/extraction_framework/file/2a
322c5c6692/core/src/main/java/org/dbpedia/extraction/sources/WikipediaDumpPa
rser.java#l74
----------------------------------------------------------------------------
--
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and
threat landscape has changed and how IT managers can respond. Discussions
will include endpoint security, mobile security and the latest in malware
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Dbpedia-discussion mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion
--
---
Pablo N. Mendes
http://pablomendes.com
Events: <http://wole2012.eurecom.fr/> http://wole2012.eurecom.fr
No virus found in this incoming message.
Checked by AVG - www.avg.com
Version: 8.5.455 / Virus Database: 271.1.1/5265 - Release Date: 09/18/12
19:47:00
------------------------------------------------------------------------------
Got visibility?
Most devs has no idea what their production app looks like.
Find out how fast your code is with AppDynamics Lite.
http://ad.doubleclick.net/clk;262219671;13503038;y?
http://info.appdynamics.com/FreeJavaPerformanceDownload.html
_______________________________________________
Dbp-spotlight-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbp-spotlight-users