Hi Pedro,
This is great feedback! Thanks a ton!
Comments inline:
On Fri, Sep 21, 2012 at 10:05 AM, Pedro Debevere <[email protected]>wrote:
> Hi,****
>
> ** **
>
> I’ve read the internationalization guide (which is very informative) and
> have the following additional remarks/questions:****
>
> ** **
>
> **- **I think that it can be beneficial to combine both the
> instance_type_$lang and instance_type_en data sets in order to obtain more
> instance type information. The instance_type_nl.nt data set for example
> currently contains only a small amount of instance type information (due to
> the small amount of mappings currently defined for the Dutch version of
> Wikipedia). This can then be done for example by checking resources in the
> canonical DBpedia data set that have a owl:sameAs link with a localized
> version.
>
Yes. We thought about that. If you want to do it for Dutch, it is just a
matter of selecting the triples you want and concatenating them to the end
of your current NT file. We are currently working on a number of
cross-language features for a paper. We will report our results to this
list whenever we finish our tests.
> ****
>
> **- **The .nt data sets of the canonicalized DBpedia use URI’s
> (URL encoded), whereas the .nt files of the localized editions seem to use
> IRI’s (Unicode escape characters). Could this lead to any problems during
> index generation?
>
Good question. :) We'd love to hear about your experience. Perhaps we could
have problems here:
https://github.com/dbpedia-spotlight/dbpedia-spotlight/blob/master/core/src/main/scala/org/dbpedia/spotlight/string/ModifiedWikiUtil.scala
> ****
>
> **- **The list of Lucene’s Analyzers listed on the
> Internationalization page does seem to list only subclasses of
> org.apache.lucene.analysis.StopwordAnalyzerBase.
> Should this list be extended to include all subclasses of
> org.apache.lucene.analyser.Analyzer (e.g., DutchAnalyzer, which is a
> subclass of org.apache.lucene.analysis.ReusableAnalyzerBase) or are these
> the only useable analyzers in Spotlight?
>
AFAIK, any instance of Analyzer should work.
> ****
>
> **- **Can the amount of memory needed to generate the index be
> reduced by setting the minNumDocsBeforeFlush variable in
> IndexMergedOccurrences.scala to a smaller value?
>
I think so. We test if minDocsBeforeFlush have been processed, and process
them at that point [L129]
[L129]
https://github.com/dbpedia-spotlight/dbpedia-spotlight/blob/master/index/src/main/java/org/dbpedia/spotlight/lucene/index/MergedOccurrencesContextIndexer.java#L129
> ****
>
> **- **Maybe the remark posted here [1] (it this still is the
> case) is also relevant and therefore should maybe also be mentioned on the
> Internationalization page.
>
Yes, I think so. Thanks for catching that!
Cheers,
Pablo
> ****
>
> ** **
>
> [1]
> http://www.mail-archive.com/[email protected]/msg03005.html
> ****
>
> ** **
>
> **
>
> **
>
> ** **
>
> *From:* Pablo N. Mendes [mailto:[email protected]]
> *Sent:* Wednesday, September 19, 2012 5:08 PM
> *To:* Max Jakob
> *Cc:* Pedro Debevere; [email protected]; Dimitris
> Kontokostas; [email protected]
>
> *Subject:* Re: [Dbpedia-discussion] DBpedia Extraction Framework Dutch
> disambiguation data set****
>
> ** **
>
> Hi Pedro,****
>
> Whenever you get all the DBpedia datasets you need, it's time to run
> DBpedia Spotlight indexing.****
>
> ** **
>
> You will be glad to know that we've been working on a step-by-step guide
> to build DBpedia Spotlight for other languages. Check this out:****
>
>
> https://github.com/dbpedia-spotlight/dbpedia-spotlight/wiki/Internationalization
> ****
>
> ** **
>
> You should also coordinate with Dimitris Kontokostas, who has been up to
> now my contact for the Dutch DBpedia, and the one I had been including in
> the DBpedia Spotlight i18n thread. Perhaps you can help each other out.***
> *
>
> ** **
>
> We hope to have a much improved (more automated) indexing process in the
> next couple of days, so keep in touch.****
>
> Please join dbp-spotlight-users for questions about DBpedia Spotlight.****
>
> https://lists.sourceforge.net/lists/listinfo/dbp-spotlight-users****
>
> ** **
>
> Cheers,****
>
> Pablo****
>
> ** **
>
> ** **
>
> On Wed, Sep 19, 2012 at 4:45 PM, Max Jakob <[email protected]> wrote:***
> *
>
> Hi,****
>
>
> On Wed, Sep 19, 2012 at 3:46 PM, Pedro Debevere <[email protected]>
> wrote:
> > I’m interested in creating a Dutch port of DBpedia Spotlight. In order
> to do
> > this, I need a disambiguation data set for Dutch. This data set is
> currently
> > not available for download. However, based on some messages posted here
> [1],
> > I suspect that the latest version of the extraction framework supports
> this.
> > Is this correct?****
>
> Generally yes, if all names of disambiguation templates are specified
> in [4]. Please also note that there seems to be an issue with multiple
> names for disambiguation page titles in dutch. See the TODO in [5].****
>
>
>
> > As a workaround I downloaded unpacked the nl-pages-articles.xml file
> myself****
>
> On your first attempt, it looks like something goes wrong during
> download. So downloading and unpacking yourself was a good idea.****
>
>
>
> > Message: expected <mediawiki> with namespace
> > [http://www.mediawiki.org/xml/export-0.6/], found
> > [http://www.mediawiki.org/xml/export-0.7/]****
>
> Wikipedia seems to have changed its export format version from 0.6 to
> 0.7. The DBpedia parser should still be able to parse the dump,
> assuming the changes mentioned in [6]. You can try to switch to the
> dump branch (currently the stable one) and change the line in [7] to
>
> private final String _namespace = "
> http://www.mediawiki.org/xml/export-0.7/";
>
> and try again. (Call mvn clean install on the project root before).
>
>
> Cheers,
> Max
>
> [4]
> http://dbpedia.hg.sourceforge.net/hgweb/dbpedia/extraction_framework/file/2a322c5c6692/core/src/main/scala/org/dbpedia/extraction/wikiparser/impl/wikipedia/Disambiguation.scala#l165
> [5]
> http://dbpedia.hg.sourceforge.net/hgweb/dbpedia/extraction_framework/file/2a322c5c6692/core/src/main/scala/org/dbpedia/extraction/config/mappings/DisambiguationExtractorConfig.scala#l16
> [6] http://www.mediawiki.org/xml/export-0.7.xsd
> [7]
> http://dbpedia.hg.sourceforge.net/hgweb/dbpedia/extraction_framework/file/2a322c5c6692/core/src/main/java/org/dbpedia/extraction/sources/WikipediaDumpParser.java#l74
>
>
> ------------------------------------------------------------------------------
> Live Security Virtual Conference
> Exclusive live event will cover all the ways today's security and
> threat landscape has changed and how IT managers can respond. Discussions
> will include endpoint security, mobile security and the latest in malware
> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
> _______________________________________________
> Dbpedia-discussion mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion****
>
>
>
> ****
>
> ** **
>
> -- ****
>
> ---****
>
> Pablo N. Mendes****
>
> http://pablomendes.com****
>
> Events: http://wole2012.eurecom.fr****
>
> ** **
>
> No virus found in this incoming message.
>
> Checked by AVG - www.avg.com
> Version: 8.5.455 / Virus Database: 271.1.1/5265 - Release Date: 09/18/12
> 19:47:00
> ****
>
>
>
> ------------------------------------------------------------------------------
> Got visibility?
> Most devs has no idea what their production app looks like.
> Find out how fast your code is with AppDynamics Lite.
> http://ad.doubleclick.net/clk;262219671;13503038;y?
> http://info.appdynamics.com/FreeJavaPerformanceDownload.html
> _______________________________________________
> Dbp-spotlight-users mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/dbp-spotlight-users
>
>
--
---
Pablo N. Mendes
http://pablomendes.com
Events: http://wole2012.eurecom.fr
------------------------------------------------------------------------------
Got visibility?
Most devs has no idea what their production app looks like.
Find out how fast your code is with AppDynamics Lite.
http://ad.doubleclick.net/clk;262219671;13503038;y?
http://info.appdynamics.com/FreeJavaPerformanceDownload.html
_______________________________________________
Dbp-spotlight-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbp-spotlight-users