Re: [GSOC Idea] Disambiguation algorithm for Apache Stanbol

Rafa Haro Tue, 30 Apr 2013 09:04:03 -0700

Hi Rupert, Antonio, all

El 27/04/13 16:35, Rupert Westenthaler escribió:

For this, I would like to discuss some topics about the proposal:
>
>- Knowledge Base: I have decided to stick first to Freebase, because it has
>a REST API allowing 100k calls per day for read and 10k for write. Besides
>the REST API, an alternative could be to integrate the whole freebase graph
>in Stanbol and use their Java API to manage it. Ideally, the management
>framework should be valid for others knowledge bases as Wikipedia or
>DBpedia.
>

I recently created my first Freebase index for Stanbol (see
STANBOL-1014 for the Indexing tool). First test on an Index with all
Freebase Topics and all languages have shown very nice result! IMO
Freebase is currently for sure the better choice over DBpedia. However
one needs to see/wait how Freebase compares to the Wikidata project
[4] that only recently entered phase 2.


Designing disambiguation in a way that it can be applied to other
datasets would be for sure a great bonus. But given the good results
one can get with Freebase I would even be very interested if the
results would only work on Freebase ^^

Following Rupert's idea, I agree that maybe the best is to develop aKnowledge Base manager within Stanbol for disambiguation purposes. IMO,it would be a mistake to try to come with an universal solution. Isuppose that one wants to generate its knowledge base differentlyaccording to custom data domains. For instance, a graph representationis more suitable in "real world" knowledge bases, while most domains arewell covered with a taxonomy structure.

It would be important to develop tools to allow Stanbol to interact withthese knowledge bases from-to EntityHub sites. Of course, a good way tolearn how to do that could be developing first a nice solution only forFreebase.

>- Resources: As have been pointed before in the mailing lists, google has
>released a couple of resources to be used in disambiguation applications.
>One if a dictionary of concepts from Wikipedia, using anchor text labels in
>Wikipedia internal links to create an index of entities possible names [2].
>The second one is a dataset of texts that links to concepts in the
>Wikipedia [3] that can be used as disambiguation contexts according to
>STANBOL-1037. I need to research if similar information can be retrieved
>directly from freebase or , in other words, to check if this information is
>already incorporated in Freebase.
>

I think you can even use [2] and [3] for disambiguation on top of
Freebase as there is anyway a mapping between Freebase and DBpedia
concepts. However you will likely need a higher quality mapping as it
is currently available. Because of that I would suggest you to start
of with implementing STANBOL-1046 [5]. For possible names (or surface
forms as they are also often called) one can use the Alias in
Freebase. However AFAIK there are no information available in Freebase
similar to [3]. Related to this I fond however an interesting pager
[6]. The semi-supervised approach suggested in chapter III could
nicely work. Especially if one considers that users could manually
disambiguate Entities. In combination with other mentions extracted by
the Stanbol Enhancer this could be used to acquire the required data.

I also suggest to compare both resources and try to improve it. Forexample, AFAIK, DBpedia Spotlight also uses Wikipedia's Disambiguationand Redirect pages to collect more surface forms. My impression is thatwe can improve Google Concept Dictionary by bringing together entities'names data in Freebase and Wikipedia. Google's dictionary seems tocontain only labels used in anchor texts from internal links

>Moreover, the proposal design will try to be as generic as possible in
>order to be adaptable to any other Knowledge Base.
>

Disambiguation is not something easy and making something "generic"
makes it even harder. So IMO having one/several more specific options
would not hurt a GSoC proposal. It would also make it easier to
evaluate the proposal.

Go for it!!

Cheers

Rafa



--

------------------------------

This message should be regarded as confidential. If you have received thisemail in error please notify the sender and destroy it immediately.Statements of intent shall only become binding when confirmed in hard copyby an authorised signatory.

Zaizi Ltd is registered in England and Wales with the registration number6440931. The Registered Office is 222 Westbourne Studios, 242 Acklam Road,London W10 5JJ, UK.

Re: [GSOC Idea] Disambiguation algorithm for Apache Stanbol

Reply via email to