Re: [GSOC Idea] Disambiguation algorithm for Apache Stanbol

Rupert Westenthaler Sat, 27 Apr 2013 07:36:02 -0700

Hi Antonio

First of all thx for your interest!

On Thu, Apr 25, 2013 at 4:04 PM, Antonio Perez <[email protected]> wrote:
> Hi everybody
>
> I'm Antonio David Pérez, a new Zaizi team member and a student for a MSc at
> the University of Seville. Lastly, I've been involved in the development of
> a semantic CMS solution in a Spanish Company called Ximdex working with
> several technologies like Apache Nutch, Apache Solr and also Apache Stanbol.
>
> Currently, I've been assigned to a project that involves different
> technologies like Apache Stanbol and Apache ManifoldCF. So, related to
> Stanbol, I'm interested in the disambiguation problem, so I would like to
> prepare a proposal for GSoC about this topic.
>

If you do have already some experiences with Apache Stanbol, this
would be fore sure a big help for a GSoC project.

> I have been following last mails about disambiguation and WebID protocol. I
> would be more interesting in develop disambiguation systems within Stanbol
> using the major semantic knowledge bases. Actually, my initial idea is to
> use Freebase with the aim to make it extensible to any other database like
> Wikipedia and DBpedia. Following STANBOL-1037 [1], the main goal is to
> implement a couple of global-approach disambiguation algorithms to be used
> in Stanbol.
>

Disambiguation on "World Domain" datasets is a very important feature
for a lot of usage scenarios. So definitely very interesting and
relevant for Apache Stanbol.

> For this, I would like to discuss some topics about the proposal:
>
> - Knowledge Base: I have decided to stick first to Freebase, because it has
> a REST API allowing 100k calls per day for read and 10k for write. Besides
> the REST API, an alternative could be to integrate the whole freebase graph
> in Stanbol and use their Java API to manage it. Ideally, the management
> framework should be valid for others knowledge bases as Wikipedia or
> DBpedia.
>

I recently created my first Freebase index for Stanbol (see
STANBOL-1014 for the Indexing tool). First test on an Index with all
Freebase Topics and all languages have shown very nice result! IMO
Freebase is currently for sure the better choice over DBpedia. However
one needs to see/wait how Freebase compares to the Wikidata project
[4] that only recently entered phase 2.

Designing disambiguation in a way that it can be applied to other
datasets would be for sure a great bonus. But given the good results
one can get with Freebase I would even be very interested if the
results would only work on Freebase ^^

> - Resources: As have been pointed before in the mailing lists, google has
> released a couple of resources to be used in disambiguation applications.
> One if a dictionary of concepts from Wikipedia, using anchor text labels in
> Wikipedia internal links to create an index of entities possible names [2].
> The second one is a dataset of texts that links to concepts in the
> Wikipedia [3] that can be used as disambiguation contexts according to
> STANBOL-1037. I need to research if similar information can be retrieved
> directly from freebase or , in other words, to check if this information is
> already incorporated in Freebase.
>

I think you can even use [2] and [3] for disambiguation on top of
Freebase as there is anyway a mapping between Freebase and DBpedia
concepts. However you will likely need a higher quality mapping as it
is currently available. Because of that I would suggest you to start
of with implementing STANBOL-1046 [5]. For possible names (or surface
forms as they are also often called) one can use the Alias in
Freebase. However AFAIK there are no information available in Freebase
similar to [3]. Related to this I fond however an interesting pager
[6]. The semi-supervised approach suggested in chapter III could
nicely work. Especially if one considers that users could manually
disambiguate Entities. In combination with other mentions extracted by
the Stanbol Enhancer this could be used to acquire the required data.

> Moreover, the proposal design will try to be as generic as possible in
> order to be adaptable to any other Knowledge Base.
>

Disambiguation is not something easy and making something "generic"
makes it even harder. So IMO having one/several more specific options
would not hurt a GSoC proposal. It would also make it easier to
evaluate the proposal.

> Waiting for your comments and valuable suggestions.
>

Hope my comments provided at least some valuable information.

best
Rupert

References:

> [1] https://issues.apache.org/jira/browse/STANBOL-1037
> [2]
> http://googleresearch.blogspot.com.es/2012/05/from-words-to-concepts-and-back.html
> [3] https://code.google.com/p/wiki-links/
[4] https://www.wikidata.org/wiki/Wikidata:Main_Page
[5] https://issues.apache.org/jira/browse/STANBOL-1046
[6] 
http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en//pubs/archive/38389.pdf

>
> --
>
> ------------------------------
> This message should be regarded as confidential. If you have received this
> email in error please notify the sender and destroy it immediately.
> Statements of intent shall only become binding when confirmed in hard copy
> by an authorised signatory.
>
> Zaizi Ltd is registered in England and Wales with the registration number
> 6440931. The Registered Office is 222 Westbourne Studios, 242 Acklam Road,
> London W10 5JJ, UK.

--
| Rupert Westenthaler             [email protected]
| Bodenlehenstraße 11                             ++43-699-11108907
| A-5500 Bischofshofen

Re: [GSOC Idea] Disambiguation algorithm for Apache Stanbol

Reply via email to