[
https://issues.apache.org/jira/browse/STANBOL-223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Rupert Westenthaler resolved STANBOL-223.
-----------------------------------------
Resolution: Won't Fix
Won't fix. See STANBOL-1037 for further development on disambiguation
> Entity Disambiguation
> ---------------------
>
> Key: STANBOL-223
> URL: https://issues.apache.org/jira/browse/STANBOL-223
> Project: Stanbol
> Issue Type: New Feature
> Components: Enhancement Engines
> Reporter: Rupert Westenthaler
> Assignee: Rupert Westenthaler
> Labels: gsoc2012
>
> Adding Disambiguation support to the Stanbol Enhancer includes the following
> points
> 1. Dataset: For Disambiguation you need not only a set of Entities but also
> additional data used for the disambiguation
> * This might need some preprocessing of the data (e.g. using mentions of
> the entity in sentences; Using data from linked Entities to create a context)
> * This data need to accessible for the Stanbol Enhancer (e.g. by using the
> Entityhub, an own SolrIndex or even other means)
> 2. Deciding on possible algorithms
> * This Issue already two possible algorithms (see below and comments)
> 3. Workflow:
> a) Disambiguate while linking (basically you have the String "Paris" and
> the Sentence/Document as context and want to know if you
> should link to Paris, France or Paris, Texas)
> b) Disambiguate already linked Entities (you have 5 suggested Entities by
> two different Engines and you want to disambiguate (rank)
> them)
> 4. Validation of the Disambiguation: We need to compare enhancement quality
> with/without disambiguation
> * The Benchmarking (enhancer/benchmark) tool could be used for that
> * Question: How much time would be needed to create Benchmarking Examples
> 5. What are the expected results?
> * implementation of a (maybe more) disambiguation algorithm(s)
> * integration to the Stanbol Enhancer as one or more EnhancementEngines
> * management of the data needed for disambiguation (e.g. as part of the
> Entityhub)
> * support (tools) for creating/extracting data needed for disambiguation
> * Validation results using the enhancer/benchmarking tool
> * Documentation on the Stanbol Webpage
> * Simple Web interface showing the improved enhancement results (I am
> thinking of a single text box to put the text and two enhancement results one
> with and one without entity disambiguation.
> Optional
> * integration of user feedback to enhance learning/validation set
> Disambiguation based on Solr MLT
> ===========================
> The Idea is to use sentences with links to an Entity in a dataset (e.g.
> wikipedia) as context and compare this with the surrounding text of an Entity
> extracted from the analyzed content. Solr More Like This (MLT) queries will
> be used for the ranking.
>
> More details:
> Sentences with occurrences of the Entity can be extracted by using
> https://github.com/ogrisel/pignlproc. Functionality will be added to output
> the results (entity -[0..*]-> "sentence") as N-TRIPLEs (a RDF serialization).
> This will allow it to indexed this information (together with all the other
> information of Entities) by using the Indexing Tools porvided by the Stanbol
> Entityhub (e.g. entityhub/indexing/dbpedia).
> The following Information will be used for EntityDisambiguation:
> (1) TextAnnotations providing the label, the type as detected by the NLP
> framework, the context of the extraction
> (1b) In addition links to other Text Annotations about the same Entity could
> be used to extend the context
> (2) A Solr Index (ReferencedSite of the Stanbol Entityhub) providing at least
> the labels, types and the occurrences of the Entities
> EntityDisambiguation will filter based on the label and the type (filter
> query) and rank selected Entities based on a "More Like This" query with the
> context over the occurrences.
> A first prototype of this engine was implemented during the bbuzz - Semantic
> Hackatron (http://berlinbuzzwords.de/wiki/semantic-nlp-hackathon) as an own
> EnhancementEngine that uses an separate Solr Index for the MLT queries.
> The plan is to implement this as an optional (configureable) feature to the
> existing ReferencedSiteEntityTaggingEnhancementEngine. Users will be able to
> activate/deactivate Entity disambiguation via the OSGI Console if the
> required data are available for a ReferencedSite.
--
This message was sent by Atlassian JIRA
(v6.1#6144)