[ 
https://issues.apache.org/jira/browse/STANBOL-1037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rafa Haro reassigned STANBOL-1037:
----------------------------------

    Assignee: Rafa Haro

> Entity Disambiguation for Stanbol
> ---------------------------------
>
>                 Key: STANBOL-1037
>                 URL: https://issues.apache.org/jira/browse/STANBOL-1037
>             Project: Stanbol
>          Issue Type: Story
>          Components: Enhancer, Entityhub
>            Reporter: Rafa Haro
>            Assignee: Rafa Haro
>            Priority: Major
>              Labels: gsoc2013, mentoring
>         Attachments: stanbol-enhancement-workflow.001.png
>
>
> Entity Disambiguation in Stanbol would mainly refers to the process of 
> modifying the fise:confidence values of EntityAnnotations obtained as a 
> result of any Linking Engine within Stanbol (EntityLinkingEngine or 
> NamedEntityLinking). Such modifications to confidence values should be done 
> in order to obtain a ranking of possible candidates (entities) to link with 
> for each EntityAnnotation after a disambiguation process. Each candidate 
> would be an Entity within EntityHub or any other Knowledge Base configured in 
> Stanbol.
> Disambiguation
> ============
> Entity Linking is not a trivial task due to the name ambiguity problem, i.e., 
> the same name may refer to different entities in different contexts and also 
> the same entity usually can be mentioned using a set of different names. For 
> instance, the name Michael Jordan can refer to more than 20 entities in 
> Wikipedia, some of them are
> shown below:
>     - Michael Jordan(NBA Player)
>     - Michael I. Jordan(Berkeley Professor)
>     - Michael B. Jordan(American Actor)
> This situation happens not only with these well known semantic knowledge 
> bases like DBpedia or Freebase, but are also important for any enterprise 
> semantic dataset or custom vocabularies. An instant example is to resolve the 
> ambiguity within a database of employees.  
> Formally, Entity Disambiguation for Stanbol should work as follows: after an 
> enhancement process of a ContentItem using an enhancement chain that includes 
> a Linking Engine, we would get a set of TextAnnotations TA = {T1, 
> T2,......Tn}. Each TextAnnotation in TA should contain a name mention which 
> is characterized by its name, its local surrounding context 
> (fise:selection-context) and the ContentItem containing it. For each 
> TextAnnotation in TA and as a result of the Linking Engine, we would get a 
> set of EntityAnnotations EAi = {E1i, E2i,....., ENi} where i corresponds to 
> TextAnnotation i in TA. We should rely on the linking engines to provide all 
> possible entity annotations (candidates within all sites in the EntityHub) 
> for each TextAnnotation. Each EntityAnnotation is characterized by its 
> Knowledge Base (entityhub:site) and its entry in that knowledge base 
> (fise:entity-reference). The objective of the disambiguation process is to 
> rank each EntityAnnotation set EAi through the modification of its 
> EntityAnnotations' confidence values so that the entity with the higher 
> confidence value were the referent entity for the TextAnnotation associated 
> to EAi.
> Algorithms
> ========
>  ** Local Approaches
> (From [1]) Conventional entity linking approaches have focused on making 
> independent Entity Linking decisions using the local mention-to-entity 
> compatibility for each isolated mention. The essential idea was to extract 
> the discriminative features from the description of a specific entity and 
> then link each name mention in a document by comparing the contextual 
> similarity with each of its candidate referent entities. Such approach is 
> followed by Disambiguation-MLT engine in STANBOL-723.
> ** Global Approaches (Collective Entity Linking)
> The main drawback of the local-based approaches stems from the fact that they 
> do not take into consideration the interdependence between different Entity 
> Linking decisions. Specifically, the entities in a topical coherent document 
> usually are semantically related to each other. In such cases, figuring out 
> the referent entity of one name mention may in turn give us useful 
> information to link the other name mentions in the same document. That 
> suggests that disambiguation performance could be improved by resolving all 
> mentions at the same time.
> This approach only makes sense in an scenario with highly connected knowledge 
> bases where the entities are semantically related in some way.
> ** Graph Based Approaches
> In these approaches, both Knowledge Base and interdependence between possible 
> Entity Linking decisions are modeled as graphs and inference algorithms are 
> used to resolve all the mentions within a document.
> Knowledge Bases
> ==============
> As described in STANBOL-223, for Disambiguation, it is necessary to use some 
> data as disambiguation features. Disambiguation data nature will depend on 
> the knowledge base particularities. In general, it will be necessary to 
> generate a Semantic context for each candidate and process it in the 
> disambiguation algorithm. The Disambiguation Context could be a fixed data 
> structure for each kind of disambiguation engine in Stanbol and developers 
> should be in charge to develop mechanism to create those contexts for their 
> custom vocabularies or knowledge bases.
> For instance, with Local Approaches, developers should be able to configure 
> Disambiguation-MLT or any other local based disambiguation engine in order to 
> obtain a disambiguation context from EntityHub for computing its similarity 
> with mentions' contexts within the Content Item.
> This can be as easy as select Entity's disambiguation fields or as complex as 
> making calls to methods for building disambiguation contexts on the fly. 
> Normally, the first option will involve the generation of disambiguation 
> fields at EntityHub's index creation time. For instance, as described in 
> STANBOL-223, for DBPedia, it is possible to extract sentences with 
> occurrences of entities'e mentions from Wikipedia using 
> https://github.com/ogrisel/pignlproc. These sentences can be included in 
> DBPedia EntityHub index as disambiguation fields. Entities' abstract can also 
> be used for disambiguation. All these fields should be configurable (boost) 
> for disambiguation purposes.   
> General Architecture and Workflow
> ==========================
> A typical Disambiguation system architecture would include three steps: 
> ** Candidates Generation: from a surface form (name mention) in the Content 
> Item, generate a set of possible entities within EntityHub to link with. A 
> typical source of entities' names are entities' labels, but others fields can 
> be used. In this step, is it necessary to resolve how to search on that 
> names' sources: Exact Matching, Overlapping, Fuzzy Search, Full-Text Search, 
> Case-sensitive, Coreference Resolution....
> ** Candidate Ranking: rank the probabilities to be the reference entity of 
> all candidates. Basically, this step involves the execution of the specific 
> disambiguation engine as an enhancement post processing phase.
> ** Detect and Cluster Missing Entities: those mentions that actually 
> shouldn't be linked to any Entity should be extracted and grouped in clusters 
> (one cluster for each unknown entity). These entities can be suggested to the 
> user in order to include them in the knowledge base (Automatic Knowledge Base 
> Population).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to