[ https://issues.apache.org/jira/browse/STANBOL-1037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Rafa Haro reassigned STANBOL-1037: ---------------------------------- Assignee: Rafa Haro > Entity Disambiguation for Stanbol > --------------------------------- > > Key: STANBOL-1037 > URL: https://issues.apache.org/jira/browse/STANBOL-1037 > Project: Stanbol > Issue Type: Story > Components: Enhancer, Entityhub > Reporter: Rafa Haro > Assignee: Rafa Haro > Priority: Major > Labels: gsoc2013, mentoring > Attachments: stanbol-enhancement-workflow.001.png > > > Entity Disambiguation in Stanbol would mainly refers to the process of > modifying the fise:confidence values of EntityAnnotations obtained as a > result of any Linking Engine within Stanbol (EntityLinkingEngine or > NamedEntityLinking). Such modifications to confidence values should be done > in order to obtain a ranking of possible candidates (entities) to link with > for each EntityAnnotation after a disambiguation process. Each candidate > would be an Entity within EntityHub or any other Knowledge Base configured in > Stanbol. > Disambiguation > ============ > Entity Linking is not a trivial task due to the name ambiguity problem, i.e., > the same name may refer to different entities in different contexts and also > the same entity usually can be mentioned using a set of different names. For > instance, the name Michael Jordan can refer to more than 20 entities in > Wikipedia, some of them are > shown below: > - Michael Jordan(NBA Player) > - Michael I. Jordan(Berkeley Professor) > - Michael B. Jordan(American Actor) > This situation happens not only with these well known semantic knowledge > bases like DBpedia or Freebase, but are also important for any enterprise > semantic dataset or custom vocabularies. An instant example is to resolve the > ambiguity within a database of employees. > Formally, Entity Disambiguation for Stanbol should work as follows: after an > enhancement process of a ContentItem using an enhancement chain that includes > a Linking Engine, we would get a set of TextAnnotations TA = {T1, > T2,......Tn}. Each TextAnnotation in TA should contain a name mention which > is characterized by its name, its local surrounding context > (fise:selection-context) and the ContentItem containing it. For each > TextAnnotation in TA and as a result of the Linking Engine, we would get a > set of EntityAnnotations EAi = {E1i, E2i,....., ENi} where i corresponds to > TextAnnotation i in TA. We should rely on the linking engines to provide all > possible entity annotations (candidates within all sites in the EntityHub) > for each TextAnnotation. Each EntityAnnotation is characterized by its > Knowledge Base (entityhub:site) and its entry in that knowledge base > (fise:entity-reference). The objective of the disambiguation process is to > rank each EntityAnnotation set EAi through the modification of its > EntityAnnotations' confidence values so that the entity with the higher > confidence value were the referent entity for the TextAnnotation associated > to EAi. > Algorithms > ======== > ** Local Approaches > (From [1]) Conventional entity linking approaches have focused on making > independent Entity Linking decisions using the local mention-to-entity > compatibility for each isolated mention. The essential idea was to extract > the discriminative features from the description of a specific entity and > then link each name mention in a document by comparing the contextual > similarity with each of its candidate referent entities. Such approach is > followed by Disambiguation-MLT engine in STANBOL-723. > ** Global Approaches (Collective Entity Linking) > The main drawback of the local-based approaches stems from the fact that they > do not take into consideration the interdependence between different Entity > Linking decisions. Specifically, the entities in a topical coherent document > usually are semantically related to each other. In such cases, figuring out > the referent entity of one name mention may in turn give us useful > information to link the other name mentions in the same document. That > suggests that disambiguation performance could be improved by resolving all > mentions at the same time. > This approach only makes sense in an scenario with highly connected knowledge > bases where the entities are semantically related in some way. > ** Graph Based Approaches > In these approaches, both Knowledge Base and interdependence between possible > Entity Linking decisions are modeled as graphs and inference algorithms are > used to resolve all the mentions within a document. > Knowledge Bases > ============== > As described in STANBOL-223, for Disambiguation, it is necessary to use some > data as disambiguation features. Disambiguation data nature will depend on > the knowledge base particularities. In general, it will be necessary to > generate a Semantic context for each candidate and process it in the > disambiguation algorithm. The Disambiguation Context could be a fixed data > structure for each kind of disambiguation engine in Stanbol and developers > should be in charge to develop mechanism to create those contexts for their > custom vocabularies or knowledge bases. > For instance, with Local Approaches, developers should be able to configure > Disambiguation-MLT or any other local based disambiguation engine in order to > obtain a disambiguation context from EntityHub for computing its similarity > with mentions' contexts within the Content Item. > This can be as easy as select Entity's disambiguation fields or as complex as > making calls to methods for building disambiguation contexts on the fly. > Normally, the first option will involve the generation of disambiguation > fields at EntityHub's index creation time. For instance, as described in > STANBOL-223, for DBPedia, it is possible to extract sentences with > occurrences of entities'e mentions from Wikipedia using > https://github.com/ogrisel/pignlproc. These sentences can be included in > DBPedia EntityHub index as disambiguation fields. Entities' abstract can also > be used for disambiguation. All these fields should be configurable (boost) > for disambiguation purposes. > General Architecture and Workflow > ========================== > A typical Disambiguation system architecture would include three steps: > ** Candidates Generation: from a surface form (name mention) in the Content > Item, generate a set of possible entities within EntityHub to link with. A > typical source of entities' names are entities' labels, but others fields can > be used. In this step, is it necessary to resolve how to search on that > names' sources: Exact Matching, Overlapping, Fuzzy Search, Full-Text Search, > Case-sensitive, Coreference Resolution.... > ** Candidate Ranking: rank the probabilities to be the reference entity of > all candidates. Basically, this step involves the execution of the specific > disambiguation engine as an enhancement post processing phase. > ** Detect and Cluster Missing Entities: those mentions that actually > shouldn't be linked to any Entity should be extracted and grouped in clusters > (one cluster for each unknown entity). These entities can be suggested to the > user in order to include them in the knowledge base (Automatic Knowledge Base > Population). -- This message was sent by Atlassian JIRA (v7.6.3#76005)