Re: GSoC Projects and Entity Disambiguation Roadmap

Rupert Westenthaler Mon, 21 Oct 2013 06:05:17 -0700

Hi all,

To give this a start I created STANBOL-1183 [1] and added a first
suggestion for a disambiguation API.


* the `Entity Disambiguation Context` is tailored towards the "Session
(local) disambiguation" usage scenario.
* the `DisambiguationData` resembles the class with the same name of
the Disambiguation MLT engine that was already reused by this years
GSoC projects.
* the `DisambiguationContext` tries to abstract the building of
contexts from the algorithm used for disambiguation. While some
engines will come with both context and algorithm the hope is that
other might be able to reuse context or algorithm implementations.

Would be great to get some feedback about this proposal!

Next Steps: Assuming some positive Feedback I would like to start with
the Entity Disambiguation Context part of the API. Most likely I will
start with a Entityhub Representation based implementation of
`EntityContext` and a Entityhub Site based implementation of the
`EntityContextProvider`. I will also create SolrYard indexes for
datasets such as geonames.org and dbpedia for testing.

best
Rupert


[1] https://issues.apache.org/jira/browse/STANBOL-1183

On Fri, Oct 4, 2013 at 8:57 AM, Antonio David Perez Morales
<ape...@zaizi.com> wrote:
> Hi all
>
> Thanks for the support of the community (specially Rupert and Rafa) during
> the project.
>
> I agree with all the conclusions from the discussion at Stanbol IRC so we
> can define a definitive roadmap (for the time being) in order to start
> develop these topics.
>
> Regards
>
>
> On Thu, Oct 3, 2013 at 7:16 PM, Dileepa Jayakody
> <dileepajayak...@gmail.com>wrote:
>
>>
>>
>>
>> On Thu, Oct 3, 2013 at 10:21 PM, Rafa Haro <rh...@zaizi.com> wrote:
>>
>>> Hi fellas,
>>>
>>> With http://svn.apache.org/r1528907 the GSoC projects source code has
>>> been commited in a new branch that we have called "disambiguation". As you
>>> might know, this year, there were two proposals for Stanbol, both related
>>> to disambiguation engines. Dileepa Jayakody has developed an Entity
>>> Disambiguation Engine using FOAF Correlation (STANBOL-1161) and Antonio
>>> Perez a Graph-Based Freebase Disambiguation Engine (STANBOL-1156). AFAIK,
>>> the results of both projects will be published by Google next week, but
>>> according to the mentors they have successfully accomplish them. I would
>>> like to congrats both Antonio and Dileepa again for the good work.
>>
>>
>> Thanks all for the support and guidance given throughout the project, it
>> was a great experience working with Stanbol community.
>>
>>
>>> Please feel free to test both solutions. In order to do it properly, you
>>> need to go through READMEs documents because both projects use some
>>> external resources that need to be build.
>>>
>>> Because both projects have several features in common, we have been
>>> discussing at Stanbol IRC channel about a Roadmap to refactor both projects
>>> and continue improving the disambiguation stuff in Stanbol. The summary of
>>> the proposed actions is the following:
>>>
>>> 1. Create an API that would allow to easily extract disambiguation
>>> features from the context (ContenItem + Annotations). This might include a
>>> better API to deal with Annotations and the results of previous engines.
>>>
>>
>> +1, EntityAnnotation, TextAnnotation like abstractions are used for
>> various purposes in disambiguation. Therefore creating Java classes and a
>> API will be extremely useful.
>>
>>>
>>> 2. Provide a framework for Session (local) disambiguation. The framework
>>> should allow to configure disambiguation features from custom sites and to
>>> plugin algorithms that use those features
>>>
>>> Can you please give some more details on this point?
>> I guess it is a framework to plugin custom vocabularies and configure
>> disambiguation from those vocabularies? Please correct me if I have got the
>> idea wrong.
>>
>>
>>> 3. Provide a Framework for Knowledge Based Disambiguation Algorithm. He
>>> have identified three types: Text Based (e.g. Solr MLT), Graph based and
>>> Machine Learning based. ML based are more complex to generalize, so we
>>> would discard it for now. For both text and graph based, we would need to
>>> create a framework for easing KBs storing/management. Typically, text based
>>> approaches would need to store textual contents and evidences for the
>>> entities. For example Wikilinks is a dataset of documents with mentions to
>>> Freebase entities that can be used as disambiguation evidences. Graph based
>>> approaches would need to use Graph databases in order to store the
>>> relationships between the entities and provide efficient ways to manipulate
>>> the graph and plugin graph based algorithms.
>>>
>>> +1.
>>
>>> Looking Forward for your feedback.
>>>
>>> Cheers,
>>>
>>> Rafa Haro
>>>
>>> Thanks,
>> Dileepa
>>
>>> --
>>>
>>> ------------------------------
>>> This message should be regarded as confidential. If you have received
>>> this email in error please notify the sender and destroy it immediately.
>>> Statements of intent shall only become binding when confirmed in hard copy
>>> by an authorised signatory.
>>>
>>> Zaizi Ltd is registered in England and Wales with the registration number
>>> 6440931. The Registered Office is Brook House, 229 Shepherds Bush Road,
>>> London W6 7AN.
>>
>>
>>
>
> --
>
> ------------------------------
> This message should be regarded as confidential. If you have received this
> email in error please notify the sender and destroy it immediately.
> Statements of intent shall only become binding when confirmed in hard copy
> by an authorised signatory.
>
> Zaizi Ltd is registered in England and Wales with the registration number
> 6440931. The Registered Office is Brook House, 229 Shepherds Bush Road,
> London W6 7AN.



-- 
| Rupert Westenthaler             rupert.westentha...@gmail.com
| Bodenlehenstraße 11                             ++43-699-11108907
| A-5500 Bischofshofen

Re: GSoC Projects and Entity Disambiguation Roadmap

Reply via email to