[jira] [Updated] (STANBOL-1037) Entity Disambiguation for Stanbol

Rafa Haro (JIRA) Mon, 22 Apr 2013 05:19:20 -0700

     [ 
https://issues.apache.org/jira/browse/STANBOL-1037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Rafa Haro updated STANBOL-1037:
-------------------------------

    Description: 
Entity Disambiguation in Stanbol would mainly refers to the process of 
modifying the fise:confidence values of EntityAnnotations obtained as a result 
of any Linking Engine within Stanbol (EntityLinkingEngine or 
NamedEntityLinking). Such modifications to confidence values should be done in 
order to obtain a ranking of possible candidates (entities) to link with for 
each EntityAnnotation after a disambiguation process. Each candidate would be 
an Entity within EntityHub or any other Knowledge Base configured in Stanbol.

Disambiguation
============

Entity Linking is not a trivial task due to the name ambiguity problem, i.e., 
the same name may refer to different entities in different contexts and also 
the same entity usually can be mentioned using a set of different names. For 
instance, the name Michael Jordan can refer to more than 20 entities in 
Wikipedia, some of them are
shown below:

    - Michael Jordan(NBA Player)
    - Michael I. Jordan(Berkeley Professor)
    - Michael B. Jordan(American Actor)

This situation happens not only with these well known semantic knowledge bases 
like DBpedia or Freebase, but are also important for any enterprise semantic 
dataset or custom vocabularies. An instant example is to resolve the ambiguity 
within a database of employees.  

Formally, Entity Disambiguation for Stanbol should work as follows: after an 
enhancement process of a ContentItem using an enhancement chain that includes a 
Linking Engine, we would get a set of TextAnnotations TA = {T1, T2,......Tn}. 
Each TextAnnotation in TA should contain a name mention which is characterized 
by its name, its local surrounding context (fise:selection-context) and the 
ContentItem containing it. For each TextAnnotation in TA and as a result of the 
Linking Engine, we would get a set of EntityAnnotations EAi = {E1i, E2i,....., 
ENi} where i corresponds to TextAnnotation i in TA. We should rely on the 
linking engines to provide all possible entity annotations (candidates within 
all sites in the EntityHub) for each TextAnnotation. Each EntityAnnotation is 
characterized by its Knowledge Base (entityhub:site) and its entry in that 
knowledge base (fise:entity-reference). The objetive of the disambiguation 
process is to rank each EntityAnnotation set EAi through the modification of 
its EntityAnnotations' confidence values so that the entity with the higher 
confidence value were the referent entity for the TextAnnotation associated to 
EAi.


Algorithms
========

 ** Local Approaches

(From [1]) Conventional entity linking approaches have focused on making 
independent Entity Linking decisions using the local mention-to-entity 
compatibility for each isolated mention. The essential idea was to extract the 
discriminative features from the description of a specific entity and then link 
each name mention in a document by comparing the contextual similarity with 
each of its candidate referent entities. Such approach is followed by 
Disambiguation-MLT engine in STANBOL-723.


** Global Approaches (Collective Entity Linking)

The main drawback of the local-based approaches stems from the fact that they 
do not take into consideration the interdependence between different Entity 
Linking decisions. Specifically, the entities in a topical coherent document 
usually are semantically related to each other. In such cases, figuring out the 
referent entity of one name mention may in turn give us useful information to 
link the other name mentions in the same document. That suggests that 
disambiguation performance could be improved by resolving all mentions at the 
same time.

This approach only makes sense in an scenario with highly connected knowledge 
bases where the entities are semantically related in some way.

** Graph Based Approaches

In these approaches, both Knowledge Base and interdependence between possible 
Entity Linking decisions are modeled as graphs and inference algorithms are 
used to resolve all the mentions within a document.


Knowledge Bases
==============

As described in STANBOL-223, for Disambiguation, it is necessary to use some 
data as disambiguation features. Disambiguation data nature will depend on the 
knowledge base particularities. In general, it will be necessary to generate a 
Semantic context for each candidate and process it in the disambiguation 
algorithm. The Disambiguation Context could be a fixed data structure for each 
kind of disambiguation engine in Stanbol and developers should be in charge to 
develop mechanism to create those contexts for their custom vocabularies or 
knowledge bases.

For instance, with Local Approaches, developers should be able to configure 
Disambiguation-MLT or any other local based disambiguation engine in order to 
obtain a disambiguation context from EntityHub for computing its similarity 
with mentions' contexts within the Content Item.

This can be as easy as select Entity's disambiguation fields or as complex as 
making calls to methods for building disambiguation contexts on the fly. 
Normally, the first option will involve the generation of disambiguation fields 
at EntityHub's index creation time. For instance, as described in STANBOL-223, 
for DBPedia, it is possible to extract sentences with occurrences of entities'e 
mentions from Wikipedia using https://github.com/ogrisel/pignlproc. These 
sentences can be included in DBPedia EntityHub index as disambiguation fields. 
Entities' abstract can also be used for disambiguation. All these fields should 
be configurable (boost) for disambiguation purposes.   


General Architecture and Workflow
==========================

A typical Disambiguation system architecture would include three steps: 

** Candidates Generation: from a surface form (name mention) in the Content 
Item, generate a set of possible entities within EntityHub to link with. A 
typical source of entities' names are entities' labels, but others fields can 
be used. In this step, is it necessary to resolve how to search on that names' 
sources: Exact Matching, Overlapping, Fuzzy Search, Full-Text Search, 
Case-sensitive, Coreference Resolution....

** Candidate Ranking: rank the probabilities to be the reference entity of all 
candidates. Basically, this step involves the execution of the specific 
disambiguation engine as an enhancement post processing phase.

** Detect and Cluster Missing Entities: those mentions that actually shouldn't 
be linked to any Entity should be extracted and grouped in clusters (one 
cluster for each unknown entity). These entities can be suggested to the user 
in order to include them in the knowledge base (Automatic Knowledge Base 
Population).

  was:
Entity Disambiguation in Stanbol would mainly refers to the process of 
modifying the fise:confidence values of EntityAnnotations obtained as a result 
of any Linking Engine within Stanbol (EntityLinkingEngine or 
NamedEntityLinking). Such modifications to confidence values should be done in 
order to obtain a ranking of possible candidates (entities) to link with for 
each EntityAnnotation after a disambiguation process. Each candidate would be 
an Entity within EntityHub or any other Knowledge Base configured in Stanbol.

Disambiguation
============

Entity Linking is not a trivial task due to the name ambiguity problem, i.e., 
the same name may refer to different entities in different contexts and also 
the same entity usually can be mentioned using a set of different names. For 
instance, the name Michael Jordan can refer to more than 20 entities in 
Wikipedia, some of them are
shown below:

    - Michael Jordan(NBA Player)
    - Michael I. Jordan(Berkeley Professor)
    - Michael B. Jordan(American Actor)

This situation happens not only with these well known semantic knowledge bases 
like DBpedia or Freebase, but are also important for any enterprise semantic 
dataset or custom vocabularies. An instant example is to resolve the ambiguity 
within a database of employees.  

Formally, Entity Disambiguation for Stanbol should work as follows: after an 
enhancement process of a ContentItem using an enhancement chain that includes a 
Linking Engine, we would get a set of TextAnnotations TA = {T1, T2,......Tn}. 
Each TextAnnotation in TA should contain a name mention which is characterized 
by its name, its local surrounding context (fise:selection-context) and the 
ContentItem containing it. For each TextAnnotation in TA and as a result of the 
Linking Engine, we would get a set of EntityAnnotations EAi = {E1i, E2i,....., 
ENi} where i corresponds to TextAnnotation i in TA. We should rely on the 
linking engines to provide all possible entity annotations (candidates within 
all sites in the EntityHub) for each TextAnnotation. Each EntityAnnotation is 
characterized by its Knowledge Base (entityhub:site) and its entry in that 
knowledge base (fise:entity-reference). The objetive of the disambiguation 
process is to rank each EntityAnnotation set EAi through the modification of 
its EntityAnnotations' confidence values so that the entity with the higher 
confidence value were the referent entity for the TextAnnotation associated to 
EAi.


Algorithms
========

 ** Local Approaches

(From [1]) Conventional entity linking approaches have focused on making 
independent Entity Linking decisions using the local mention-to-entity 
compatibility for each isolated mention. The essential idea was to extract the 
discriminative features from the description of a specific entity and then link 
each name mention in a document by comparing the contextual similarity with 
each of its candidate referent entities. Such approach is followed by 
Disambiguation-MLT engine in STANBOL-723.


** Global Approaches (Collective Entity Linking)

The main drawback of the local-based approaches stems from the fact that they 
do not take into consideration the interdependence between different Entity 
Linking decisions. Specifically, the entities in a topical coherent document 
usually are semantically related to each other. In such cases, figuring out the 
referent entity of one name mention may in turn give us useful information to 
link the other name mentions in the same document. That suggests that 
disambiguation performance could be improved by resolving all mentions at the 
same time.

This approach only makes sense in an scenario with highly connected knowledge 
bases where the entities are semantically related in some way.

** Graph Based Approaches

In these approaches, both Knowledge Base and interdependence between possible 
Entity Linking decisions are modeled as graphs and inference algorithms are 
used to resolve all the mentions within a document.


Knowledge Bases
==============

As described in STANBOL-223, for Disambiguation, it is necessary to use some 
data as disambiguation features. Disambiguation data nature will depend on the 
knowledge base particularities. In general, it will be necessary to generate a 
Semantic context for each candidate and process it in the disambiguation 
algorithm. The Disambiguation Context could be a fixed data structure for each 
kind of disambiguation engine in Stanbol and developers should be in charge to 
develop mechanism to create those contexts for their custom vocabularies or 
knowledge bases.

For instance, with Local Approaches, developers should be able to configure 
Disambiguation-MLT or any other local based disambiguation engine in order to 
obtain a disambiguation context from EntityHub for computing its similarity 
with mentions' contexts within the Content Item.

This can be as easy as select Entity's disambiguation fields or as complex as 
making calls to methods for building disambiguation contexts on the fly. 
Normally, the first option will involve the generation of disambiguation fields 
at EntityHub's index creation time. For instance, as described in STANBOL-223, 
for DBPedia, it is possible to extract sentences with occurrences of entities'e 
mentions from Wikipedia using https://github.com/ogrisel/pignlproc. These 
sentences can be included in DBPedia EntityHub index as disambiguation fields. 
Entities' abstract can also be used for disambiguation. All these fields should 
be configurable (boost) for disambiguation purposes.   


General Architecture and Workflow
==========================



    
> Entity Disambiguation for Stanbol
> ---------------------------------
>
>                 Key: STANBOL-1037
>                 URL: https://issues.apache.org/jira/browse/STANBOL-1037
>             Project: Stanbol
>          Issue Type: Story
>          Components: Enhancer, Entityhub
>            Reporter: Rafa Haro
>              Labels: gsoc2013, mentoring
>         Attachments: stanbol-enhancement-workflow.001.png
>
>
> Entity Disambiguation in Stanbol would mainly refers to the process of 
> modifying the fise:confidence values of EntityAnnotations obtained as a 
> result of any Linking Engine within Stanbol (EntityLinkingEngine or 
> NamedEntityLinking). Such modifications to confidence values should be done 
> in order to obtain a ranking of possible candidates (entities) to link with 
> for each EntityAnnotation after a disambiguation process. Each candidate 
> would be an Entity within EntityHub or any other Knowledge Base configured in 
> Stanbol.
> Disambiguation
> ============
> Entity Linking is not a trivial task due to the name ambiguity problem, i.e., 
> the same name may refer to different entities in different contexts and also 
> the same entity usually can be mentioned using a set of different names. For 
> instance, the name Michael Jordan can refer to more than 20 entities in 
> Wikipedia, some of them are
> shown below:
>     - Michael Jordan(NBA Player)
>     - Michael I. Jordan(Berkeley Professor)
>     - Michael B. Jordan(American Actor)
> This situation happens not only with these well known semantic knowledge 
> bases like DBpedia or Freebase, but are also important for any enterprise 
> semantic dataset or custom vocabularies. An instant example is to resolve the 
> ambiguity within a database of employees.  
> Formally, Entity Disambiguation for Stanbol should work as follows: after an 
> enhancement process of a ContentItem using an enhancement chain that includes 
> a Linking Engine, we would get a set of TextAnnotations TA = {T1, 
> T2,......Tn}. Each TextAnnotation in TA should contain a name mention which 
> is characterized by its name, its local surrounding context 
> (fise:selection-context) and the ContentItem containing it. For each 
> TextAnnotation in TA and as a result of the Linking Engine, we would get a 
> set of EntityAnnotations EAi = {E1i, E2i,....., ENi} where i corresponds to 
> TextAnnotation i in TA. We should rely on the linking engines to provide all 
> possible entity annotations (candidates within all sites in the EntityHub) 
> for each TextAnnotation. Each EntityAnnotation is characterized by its 
> Knowledge Base (entityhub:site) and its entry in that knowledge base 
> (fise:entity-reference). The objetive of the disambiguation process is to 
> rank each EntityAnnotation set EAi through the modification of its 
> EntityAnnotations' confidence values so that the entity with the higher 
> confidence value were the referent entity for the TextAnnotation associated 
> to EAi.
> Algorithms
> ========
>  ** Local Approaches
> (From [1]) Conventional entity linking approaches have focused on making 
> independent Entity Linking decisions using the local mention-to-entity 
> compatibility for each isolated mention. The essential idea was to extract 
> the discriminative features from the description of a specific entity and 
> then link each name mention in a document by comparing the contextual 
> similarity with each of its candidate referent entities. Such approach is 
> followed by Disambiguation-MLT engine in STANBOL-723.
> ** Global Approaches (Collective Entity Linking)
> The main drawback of the local-based approaches stems from the fact that they 
> do not take into consideration the interdependence between different Entity 
> Linking decisions. Specifically, the entities in a topical coherent document 
> usually are semantically related to each other. In such cases, figuring out 
> the referent entity of one name mention may in turn give us useful 
> information to link the other name mentions in the same document. That 
> suggests that disambiguation performance could be improved by resolving all 
> mentions at the same time.
> This approach only makes sense in an scenario with highly connected knowledge 
> bases where the entities are semantically related in some way.
> ** Graph Based Approaches
> In these approaches, both Knowledge Base and interdependence between possible 
> Entity Linking decisions are modeled as graphs and inference algorithms are 
> used to resolve all the mentions within a document.
> Knowledge Bases
> ==============
> As described in STANBOL-223, for Disambiguation, it is necessary to use some 
> data as disambiguation features. Disambiguation data nature will depend on 
> the knowledge base particularities. In general, it will be necessary to 
> generate a Semantic context for each candidate and process it in the 
> disambiguation algorithm. The Disambiguation Context could be a fixed data 
> structure for each kind of disambiguation engine in Stanbol and developers 
> should be in charge to develop mechanism to create those contexts for their 
> custom vocabularies or knowledge bases.
> For instance, with Local Approaches, developers should be able to configure 
> Disambiguation-MLT or any other local based disambiguation engine in order to 
> obtain a disambiguation context from EntityHub for computing its similarity 
> with mentions' contexts within the Content Item.
> This can be as easy as select Entity's disambiguation fields or as complex as 
> making calls to methods for building disambiguation contexts on the fly. 
> Normally, the first option will involve the generation of disambiguation 
> fields at EntityHub's index creation time. For instance, as described in 
> STANBOL-223, for DBPedia, it is possible to extract sentences with 
> occurrences of entities'e mentions from Wikipedia using 
> https://github.com/ogrisel/pignlproc. These sentences can be included in 
> DBPedia EntityHub index as disambiguation fields. Entities' abstract can also 
> be used for disambiguation. All these fields should be configurable (boost) 
> for disambiguation purposes.   
> General Architecture and Workflow
> ==========================
> A typical Disambiguation system architecture would include three steps: 
> ** Candidates Generation: from a surface form (name mention) in the Content 
> Item, generate a set of possible entities within EntityHub to link with. A 
> typical source of entities' names are entities' labels, but others fields can 
> be used. In this step, is it necessary to resolve how to search on that 
> names' sources: Exact Matching, Overlapping, Fuzzy Search, Full-Text Search, 
> Case-sensitive, Coreference Resolution....
> ** Candidate Ranking: rank the probabilities to be the reference entity of 
> all candidates. Basically, this step involves the execution of the specific 
> disambiguation engine as an enhancement post processing phase.
> ** Detect and Cluster Missing Entities: those mentions that actually 
> shouldn't be linked to any Entity should be extracted and grouped in clusters 
> (one cluster for each unknown entity). These entities can be suggested to the 
> user in order to include them in the knowledge base (Automatic Knowledge Base 
> Population).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (STANBOL-1037) Entity Disambiguation for Stanbol

Reply via email to