[ 
https://issues.apache.org/jira/browse/STANBOL-223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rupert Westenthaler resolved STANBOL-223.
-----------------------------------------

    Resolution: Won't Fix

Won't fix. See STANBOL-1037 for further development on disambiguation

> Entity Disambiguation
> ---------------------
>
>                 Key: STANBOL-223
>                 URL: https://issues.apache.org/jira/browse/STANBOL-223
>             Project: Stanbol
>          Issue Type: New Feature
>          Components: Enhancement Engines
>            Reporter: Rupert Westenthaler
>            Assignee: Rupert Westenthaler
>              Labels: gsoc2012
>
> Adding Disambiguation support to the Stanbol Enhancer includes the following 
> points
> 1. Dataset: For Disambiguation you need not only a set of Entities but also 
> additional data used for the disambiguation
>   * This might need some preprocessing of the data (e.g. using mentions of 
> the entity in sentences; Using data from linked Entities to create a context)
>   * This data need to accessible for the Stanbol Enhancer (e.g. by using the 
> Entityhub, an own SolrIndex or even other means)
> 2. Deciding on possible algorithms
>   * This Issue already two possible algorithms (see below and comments)
> 3. Workflow:
>   a) Disambiguate while linking (basically you have the String "Paris" and 
> the Sentence/Document as context and want to know if you
> should link to Paris, France or Paris, Texas)
>   b) Disambiguate already linked Entities (you have 5 suggested Entities by 
> two different Engines and you want to disambiguate (rank)
> them)
> 4. Validation of the Disambiguation: We need to compare enhancement quality 
> with/without disambiguation
>   * The Benchmarking (enhancer/benchmark) tool could be used for that
>   * Question: How much time would be needed to create Benchmarking Examples
> 5. What are the expected results?
>   * implementation of a (maybe more) disambiguation algorithm(s)
>   * integration to the Stanbol Enhancer as one or more EnhancementEngines
>   * management of the data needed for disambiguation (e.g. as part of the 
> Entityhub)
>   * support (tools) for creating/extracting data needed for disambiguation
>   * Validation results using the enhancer/benchmarking tool
>   * Documentation on the Stanbol Webpage
>   * Simple Web interface showing the improved enhancement results (I am 
> thinking of a single text box to put the text and two enhancement results one 
> with and one without entity disambiguation.
> Optional
>   * integration of user feedback to enhance learning/validation set
> Disambiguation based on Solr MLT
> ===========================
> The Idea is to use sentences with links to an Entity in a dataset (e.g. 
> wikipedia) as context and compare this with the surrounding text of an Entity 
> extracted from the analyzed content. Solr More Like This (MLT) queries will 
> be used for the ranking. 
>  
> More details:
> Sentences with occurrences of the Entity can be extracted by using 
> https://github.com/ogrisel/pignlproc. Functionality will be added to output 
> the results (entity -[0..*]-> "sentence") as N-TRIPLEs (a RDF serialization). 
> This will allow it to indexed this information (together with all the other 
> information of Entities) by using the Indexing Tools porvided by the Stanbol 
> Entityhub (e.g. entityhub/indexing/dbpedia).
> The following Information will be used for EntityDisambiguation:
> (1) TextAnnotations providing the label, the type as detected by the NLP 
> framework, the context of the extraction
> (1b) In addition links to other Text Annotations about the same Entity could 
> be used to extend the context
> (2) A Solr Index (ReferencedSite of the Stanbol Entityhub) providing at least 
> the labels, types and the occurrences of the Entities
> EntityDisambiguation will filter based on the label and the type (filter 
> query) and rank selected Entities based on a "More Like This" query with the 
> context over the occurrences.
> A first prototype of this engine was implemented during the bbuzz - Semantic 
> Hackatron (http://berlinbuzzwords.de/wiki/semantic-nlp-hackathon) as an own 
> EnhancementEngine that uses an separate Solr Index for the MLT queries.
> The plan is to implement this as an optional (configureable) feature to the 
> existing ReferencedSiteEntityTaggingEnhancementEngine. Users will be able to 
> activate/deactivate Entity disambiguation via the OSGI Console if the 
> required data are available for a ReferencedSite.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to