[
https://issues.apache.org/jira/browse/STANBOL-1037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Rafa Haro updated STANBOL-1037:
-------------------------------
Description:
Entity Disambiguation in Stanbol would mainly refers to the process of
modifying the fise:confidence values of EntityAnnotations obtained as a result
of any Linking Engine within Stanbol (EntityLinkingEngine or
NamedEntityLinking). Such modifications to confidence values should be done in
order to obtain a ranking of possible candidates (entities) to link with for
each EntityAnnotation after a disambiguation process. Each candidate would be
an Entity within EntityHub or any other Knowledge Base configured in Stanbol.
Disambiguation
============
Entity Linking is not a trivial task due to the name ambiguity problem, i.e.,
the same name may refer to different entities in different contexts and also
the same entity usually can be mentioned using a set of different names. For
instance, the name Michael Jordan can refer to more than 20 entities in
Wikipedia, some of them are
shown below:
- Michael Jordan(NBA Player)
- Michael I. Jordan(Berkeley Professor)
- Michael B. Jordan(American Actor)
This situation happens not only with these well known semantic knowledge bases
like DBpedia or Freebase, but are also important for any enterprise semantic
dataset or custom vocabularies. An instant example is to resolve the ambiguity
within a database of employees.
Formally, Entity Disambiguation for Stanbol should work as follows: after an
enhancement process of a ContentItem using an enhancement chain that includes a
Linking Engine, we would get a set of TextAnnotations TA = {T1, T2,......Tn}.
Each TextAnnotation in TA should contain a name mention which is characterized
by its name, its local surrounding context (fise:selection-context) and the
ContentItem containing it. For each TextAnnotation in TA and as a result of the
Linking Engine, we would get a set of EntityAnnotations EAi = {E1i, E2i,.....,
ENi} where i corresponds to TextAnnotation i in TA. We should rely on the
linking engines to provide all possible entity annotations (candidates within
all sites in the EntityHub) for each TextAnnotation. Each EntityAnnotation is
characterized by its Knowledge Base (entityhub:site) and its entry in that
knowledge base (fise:entity-reference). The objetive of the disambiguation
process is to rank each EntityAnnotation set EAi through the modification of
its EntityAnnotations' confidence values so that the entity with the higher
confidence value were the referent entity for the TextAnnotation associated to
EAi.
Algorithms
========
** Local Approaches
(From [1]) Conventional entity linking approaches have focused on making
independent Entity Linking decisions using the local mention-to-entity
compatibility for each isolated mention. The essential idea was to extract the
discriminative features from the description of a specific entity and then link
each name mention in a document by comparing the contextual similarity with
each of its candidate referent entities. Such approach is followed by
Disambiguation-MLT engine in STANBOL-723.
** Global Approaches (Collective Entity Linking)
The main drawback of the local-based approaches stems from the fact that they
do not take into consideration the interdependence between different Entity
Linking decisions. Specifically, the entities in a topical coherent document
usually are semantically related to each other. In such cases, figuring out the
referent entity of one name mention may in turn give us useful information to
link the other name mentions in the same document. That suggests that
disambiguation performance could be improved by resolving all mentions at the
same time.
This approach only makes sense in an scenario with highly connected knowledge
bases where the entities are semantically related in some way.
** Graph Based Approaches
In these approaches, both Knowledge Base and interdependence between possible
Entity Linking decisions are modeled as graphs and inference algorithms are
used to resolve all the mentions within a document.
Knowledge Bases
==============
As described in STANBOL-223, for Disambiguation, it is necessary to use some
data as disambiguation features. Disambiguation data nature will depend on the
knowledge base particularities. In general, it will be necessary to generate a
Semantic context for each candidate and process it in the disambiguation
algorithm. The Disambiguation Context could be a fixed data structure for each
kind of disambiguation engine in Stanbol and developers should be in charge to
develop mechanism to create those contexts for their custom vocabularies or
knowledge bases.
For instance, with Local Approaches, developers should be able to configure
Disambiguation-MLT or any other local based disambiguation engine in order to
obtain a disambiguation context from EntityHub for computing its similarity
with mentions' contexts within the Content Item.
This can be as easy as select Entity's disambiguation fields or as complex as
making calls to methods for building disambiguation contexts on the fly.
Normally, the first option will involve the generation of disambiguation fields
at EntityHub's index creation time. For instance, as described in STANBOL-223,
for DBPedia, it is possible to extract sentences with occurrences of entities'e
mentions from Wikipedia using https://github.com/ogrisel/pignlproc. These
sentences can be included in DBPedia EntityHub index as disambiguation fields.
Entities' abstract can also be used for disambiguation. All these fields should
be configurable (boost) for disambiguation purposes.
General Architecture and Workflow
==========================
A typical Disambiguation system architecture would include three steps:
** Candidates Generation: from a surface form (name mention) in the Content
Item, generate a set of possible entities within EntityHub to link with. A
typical source of entities' names are entities' labels, but others fields can
be used. In this step, is it necessary to resolve how to search on that names'
sources: Exact Matching, Overlapping, Fuzzy Search, Full-Text Search,
Case-sensitive, Coreference Resolution....
** Candidate Ranking: rank the probabilities to be the reference entity of all
candidates. Basically, this step involves the execution of the specific
disambiguation engine as an enhancement post processing phase.
** Detect and Cluster Missing Entities: those mentions that actually shouldn't
be linked to any Entity should be extracted and grouped in clusters (one
cluster for each unknown entity). These entities can be suggested to the user
in order to include them in the knowledge base (Automatic Knowledge Base
Population).
was:
Entity Disambiguation in Stanbol would mainly refers to the process of
modifying the fise:confidence values of EntityAnnotations obtained as a result
of any Linking Engine within Stanbol (EntityLinkingEngine or
NamedEntityLinking). Such modifications to confidence values should be done in
order to obtain a ranking of possible candidates (entities) to link with for
each EntityAnnotation after a disambiguation process. Each candidate would be
an Entity within EntityHub or any other Knowledge Base configured in Stanbol.
Disambiguation
============
Entity Linking is not a trivial task due to the name ambiguity problem, i.e.,
the same name may refer to different entities in different contexts and also
the same entity usually can be mentioned using a set of different names. For
instance, the name Michael Jordan can refer to more than 20 entities in
Wikipedia, some of them are
shown below:
- Michael Jordan(NBA Player)
- Michael I. Jordan(Berkeley Professor)
- Michael B. Jordan(American Actor)
This situation happens not only with these well known semantic knowledge bases
like DBpedia or Freebase, but are also important for any enterprise semantic
dataset or custom vocabularies. An instant example is to resolve the ambiguity
within a database of employees.
Formally, Entity Disambiguation for Stanbol should work as follows: after an
enhancement process of a ContentItem using an enhancement chain that includes a
Linking Engine, we would get a set of TextAnnotations TA = {T1, T2,......Tn}.
Each TextAnnotation in TA should contain a name mention which is characterized
by its name, its local surrounding context (fise:selection-context) and the
ContentItem containing it. For each TextAnnotation in TA and as a result of the
Linking Engine, we would get a set of EntityAnnotations EAi = {E1i, E2i,.....,
ENi} where i corresponds to TextAnnotation i in TA. We should rely on the
linking engines to provide all possible entity annotations (candidates within
all sites in the EntityHub) for each TextAnnotation. Each EntityAnnotation is
characterized by its Knowledge Base (entityhub:site) and its entry in that
knowledge base (fise:entity-reference). The objetive of the disambiguation
process is to rank each EntityAnnotation set EAi through the modification of
its EntityAnnotations' confidence values so that the entity with the higher
confidence value were the referent entity for the TextAnnotation associated to
EAi.
Algorithms
========
** Local Approaches
(From [1]) Conventional entity linking approaches have focused on making
independent Entity Linking decisions using the local mention-to-entity
compatibility for each isolated mention. The essential idea was to extract the
discriminative features from the description of a specific entity and then link
each name mention in a document by comparing the contextual similarity with
each of its candidate referent entities. Such approach is followed by
Disambiguation-MLT engine in STANBOL-723.
** Global Approaches (Collective Entity Linking)
The main drawback of the local-based approaches stems from the fact that they
do not take into consideration the interdependence between different Entity
Linking decisions. Specifically, the entities in a topical coherent document
usually are semantically related to each other. In such cases, figuring out the
referent entity of one name mention may in turn give us useful information to
link the other name mentions in the same document. That suggests that
disambiguation performance could be improved by resolving all mentions at the
same time.
This approach only makes sense in an scenario with highly connected knowledge
bases where the entities are semantically related in some way.
** Graph Based Approaches
In these approaches, both Knowledge Base and interdependence between possible
Entity Linking decisions are modeled as graphs and inference algorithms are
used to resolve all the mentions within a document.
Knowledge Bases
==============
As described in STANBOL-223, for Disambiguation, it is necessary to use some
data as disambiguation features. Disambiguation data nature will depend on the
knowledge base particularities. In general, it will be necessary to generate a
Semantic context for each candidate and process it in the disambiguation
algorithm. The Disambiguation Context could be a fixed data structure for each
kind of disambiguation engine in Stanbol and developers should be in charge to
develop mechanism to create those contexts for their custom vocabularies or
knowledge bases.
For instance, with Local Approaches, developers should be able to configure
Disambiguation-MLT or any other local based disambiguation engine in order to
obtain a disambiguation context from EntityHub for computing its similarity
with mentions' contexts within the Content Item.
This can be as easy as select Entity's disambiguation fields or as complex as
making calls to methods for building disambiguation contexts on the fly.
Normally, the first option will involve the generation of disambiguation fields
at EntityHub's index creation time. For instance, as described in STANBOL-223,
for DBPedia, it is possible to extract sentences with occurrences of entities'e
mentions from Wikipedia using https://github.com/ogrisel/pignlproc. These
sentences can be included in DBPedia EntityHub index as disambiguation fields.
Entities' abstract can also be used for disambiguation. All these fields should
be configurable (boost) for disambiguation purposes.
General Architecture and Workflow
==========================
> Entity Disambiguation for Stanbol
> ---------------------------------
>
> Key: STANBOL-1037
> URL: https://issues.apache.org/jira/browse/STANBOL-1037
> Project: Stanbol
> Issue Type: Story
> Components: Enhancer, Entityhub
> Reporter: Rafa Haro
> Labels: gsoc2013, mentoring
> Attachments: stanbol-enhancement-workflow.001.png
>
>
> Entity Disambiguation in Stanbol would mainly refers to the process of
> modifying the fise:confidence values of EntityAnnotations obtained as a
> result of any Linking Engine within Stanbol (EntityLinkingEngine or
> NamedEntityLinking). Such modifications to confidence values should be done
> in order to obtain a ranking of possible candidates (entities) to link with
> for each EntityAnnotation after a disambiguation process. Each candidate
> would be an Entity within EntityHub or any other Knowledge Base configured in
> Stanbol.
> Disambiguation
> ============
> Entity Linking is not a trivial task due to the name ambiguity problem, i.e.,
> the same name may refer to different entities in different contexts and also
> the same entity usually can be mentioned using a set of different names. For
> instance, the name Michael Jordan can refer to more than 20 entities in
> Wikipedia, some of them are
> shown below:
> - Michael Jordan(NBA Player)
> - Michael I. Jordan(Berkeley Professor)
> - Michael B. Jordan(American Actor)
> This situation happens not only with these well known semantic knowledge
> bases like DBpedia or Freebase, but are also important for any enterprise
> semantic dataset or custom vocabularies. An instant example is to resolve the
> ambiguity within a database of employees.
> Formally, Entity Disambiguation for Stanbol should work as follows: after an
> enhancement process of a ContentItem using an enhancement chain that includes
> a Linking Engine, we would get a set of TextAnnotations TA = {T1,
> T2,......Tn}. Each TextAnnotation in TA should contain a name mention which
> is characterized by its name, its local surrounding context
> (fise:selection-context) and the ContentItem containing it. For each
> TextAnnotation in TA and as a result of the Linking Engine, we would get a
> set of EntityAnnotations EAi = {E1i, E2i,....., ENi} where i corresponds to
> TextAnnotation i in TA. We should rely on the linking engines to provide all
> possible entity annotations (candidates within all sites in the EntityHub)
> for each TextAnnotation. Each EntityAnnotation is characterized by its
> Knowledge Base (entityhub:site) and its entry in that knowledge base
> (fise:entity-reference). The objetive of the disambiguation process is to
> rank each EntityAnnotation set EAi through the modification of its
> EntityAnnotations' confidence values so that the entity with the higher
> confidence value were the referent entity for the TextAnnotation associated
> to EAi.
> Algorithms
> ========
> ** Local Approaches
> (From [1]) Conventional entity linking approaches have focused on making
> independent Entity Linking decisions using the local mention-to-entity
> compatibility for each isolated mention. The essential idea was to extract
> the discriminative features from the description of a specific entity and
> then link each name mention in a document by comparing the contextual
> similarity with each of its candidate referent entities. Such approach is
> followed by Disambiguation-MLT engine in STANBOL-723.
> ** Global Approaches (Collective Entity Linking)
> The main drawback of the local-based approaches stems from the fact that they
> do not take into consideration the interdependence between different Entity
> Linking decisions. Specifically, the entities in a topical coherent document
> usually are semantically related to each other. In such cases, figuring out
> the referent entity of one name mention may in turn give us useful
> information to link the other name mentions in the same document. That
> suggests that disambiguation performance could be improved by resolving all
> mentions at the same time.
> This approach only makes sense in an scenario with highly connected knowledge
> bases where the entities are semantically related in some way.
> ** Graph Based Approaches
> In these approaches, both Knowledge Base and interdependence between possible
> Entity Linking decisions are modeled as graphs and inference algorithms are
> used to resolve all the mentions within a document.
> Knowledge Bases
> ==============
> As described in STANBOL-223, for Disambiguation, it is necessary to use some
> data as disambiguation features. Disambiguation data nature will depend on
> the knowledge base particularities. In general, it will be necessary to
> generate a Semantic context for each candidate and process it in the
> disambiguation algorithm. The Disambiguation Context could be a fixed data
> structure for each kind of disambiguation engine in Stanbol and developers
> should be in charge to develop mechanism to create those contexts for their
> custom vocabularies or knowledge bases.
> For instance, with Local Approaches, developers should be able to configure
> Disambiguation-MLT or any other local based disambiguation engine in order to
> obtain a disambiguation context from EntityHub for computing its similarity
> with mentions' contexts within the Content Item.
> This can be as easy as select Entity's disambiguation fields or as complex as
> making calls to methods for building disambiguation contexts on the fly.
> Normally, the first option will involve the generation of disambiguation
> fields at EntityHub's index creation time. For instance, as described in
> STANBOL-223, for DBPedia, it is possible to extract sentences with
> occurrences of entities'e mentions from Wikipedia using
> https://github.com/ogrisel/pignlproc. These sentences can be included in
> DBPedia EntityHub index as disambiguation fields. Entities' abstract can also
> be used for disambiguation. All these fields should be configurable (boost)
> for disambiguation purposes.
> General Architecture and Workflow
> ==========================
> A typical Disambiguation system architecture would include three steps:
> ** Candidates Generation: from a surface form (name mention) in the Content
> Item, generate a set of possible entities within EntityHub to link with. A
> typical source of entities' names are entities' labels, but others fields can
> be used. In this step, is it necessary to resolve how to search on that
> names' sources: Exact Matching, Overlapping, Fuzzy Search, Full-Text Search,
> Case-sensitive, Coreference Resolution....
> ** Candidate Ranking: rank the probabilities to be the reference entity of
> all candidates. Basically, this step involves the execution of the specific
> disambiguation engine as an enhancement post processing phase.
> ** Detect and Cluster Missing Entities: those mentions that actually
> shouldn't be linked to any Entity should be extracted and grouped in clusters
> (one cluster for each unknown entity). These entities can be suggested to the
> user in order to include them in the knowledge base (Automatic Knowledge Base
> Population).
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira