Re: [GSOC Idea] Disambiguation algorithm for Apache Stanbol

Rupert Westenthaler Sat, 27 Apr 2013 22:28:36 -0700

Hi harish


On Sat, Apr 27, 2013 at 8:00 PM, harish suvarna <[email protected]> wrote:
> Rupert,
> Is the freebase index available for use? From where I can get it?
> I can compare the entities I get from dbpedia and freebase on some of my
> test files.

The index is installed on

    http://dev.iks-project.eu:8083/entityhub/site/freebase/

you can also use it with the enhancer by using

    http://dev.iks-project.eu:8083/enhancer/chain/freebase-proper-noun

I would be happy to provide the index for download, but for now it is
not available because of its size. The 'freebase.solrindex.zip' file
has ~17GByte and if multiple users would try to download it this could
very well 'crash' the Internet connection of our company. If someone
has the infrastructure around to serve indexes of that size I am happy
to upload it their.

>
> Is [2]
> http://googleresearch.blogspot.com.es/2012/05/from-words-to-concepts-and-back.htmlused
> by any engine now in stanbol?

Not until now

> Are there any programming apis avaialble to access the concept dictionary?
>

Not as part of Stanbol.

> -harish
>
>
> On Sat, Apr 27, 2013 at 7:35 AM, Rupert Westenthaler <
> [email protected]> wrote:
>
>> Hi Antonio
>>
>> First of all thx for your interest!
>>
>> On Thu, Apr 25, 2013 at 4:04 PM, Antonio Perez <[email protected]> wrote:
>> > Hi everybody
>> >
>> > I'm Antonio David Pérez, a new Zaizi team member and a student for a MSc
>> at
>> > the University of Seville. Lastly, I've been involved in the development
>> of
>> > a semantic CMS solution in a Spanish Company called Ximdex working with
>> > several technologies like Apache Nutch, Apache Solr and also Apache
>> Stanbol.
>> >
>> > Currently, I've been assigned to a project that involves different
>> > technologies like Apache Stanbol and Apache ManifoldCF. So, related to
>> > Stanbol, I'm interested in the disambiguation problem, so I would like to
>> > prepare a proposal for GSoC about this topic.
>> >
>>
>> If you do have already some experiences with Apache Stanbol, this
>> would be fore sure a big help for a GSoC project.
>>
>> > I have been following last mails about disambiguation and WebID
>> protocol. I
>> > would be more interesting in develop disambiguation systems within
>> Stanbol
>> > using the major semantic knowledge bases. Actually, my initial idea is to
>> > use Freebase with the aim to make it extensible to any other database
>> like
>> > Wikipedia and DBpedia. Following STANBOL-1037 [1], the main goal is to
>> > implement a couple of global-approach disambiguation algorithms to be
>> used
>> > in Stanbol.
>> >
>>
>> Disambiguation on "World Domain" datasets is a very important feature
>> for a lot of usage scenarios. So definitely very interesting and
>> relevant for Apache Stanbol.
>>
>> > For this, I would like to discuss some topics about the proposal:
>> >
>> > - Knowledge Base: I have decided to stick first to Freebase, because it
>> has
>> > a REST API allowing 100k calls per day for read and 10k for write.
>> Besides
>> > the REST API, an alternative could be to integrate the whole freebase
>> graph
>> > in Stanbol and use their Java API to manage it. Ideally, the management
>> > framework should be valid for others knowledge bases as Wikipedia or
>> > DBpedia.
>> >
>>
>> I recently created my first Freebase index for Stanbol (see
>> STANBOL-1014 for the Indexing tool). First test on an Index with all
>> Freebase Topics and all languages have shown very nice result! IMO
>> Freebase is currently for sure the better choice over DBpedia. However
>> one needs to see/wait how Freebase compares to the Wikidata project
>> [4] that only recently entered phase 2.
>>
>> Designing disambiguation in a way that it can be applied to other
>> datasets would be for sure a great bonus. But given the good results
>> one can get with Freebase I would even be very interested if the
>> results would only work on Freebase ^^
>>
>> > - Resources: As have been pointed before in the mailing lists, google has
>> > released a couple of resources to be used in disambiguation applications.
>> > One if a dictionary of concepts from Wikipedia, using anchor text labels
>> in
>> > Wikipedia internal links to create an index of entities possible names
>> [2].
>> > The second one is a dataset of texts that links to concepts in the
>> > Wikipedia [3] that can be used as disambiguation contexts according to
>> > STANBOL-1037. I need to research if similar information can be retrieved
>> > directly from freebase or , in other words, to check if this information
>> is
>> > already incorporated in Freebase.
>> >
>>
>> I think you can even use [2] and [3] for disambiguation on top of
>> Freebase as there is anyway a mapping between Freebase and DBpedia
>> concepts. However you will likely need a higher quality mapping as it
>> is currently available. Because of that I would suggest you to start
>> of with implementing STANBOL-1046 [5]. For possible names (or surface
>> forms as they are also often called) one can use the Alias in
>> Freebase. However AFAIK there are no information available in Freebase
>> similar to [3]. Related to this I fond however an interesting pager
>> [6]. The semi-supervised approach suggested in chapter III could
>> nicely work. Especially if one considers that users could manually
>> disambiguate Entities. In combination with other mentions extracted by
>> the Stanbol Enhancer this could be used to acquire the required data.
>>
>> > Moreover, the proposal design will try to be as generic as possible in
>> > order to be adaptable to any other Knowledge Base.
>> >
>>
>> Disambiguation is not something easy and making something "generic"
>> makes it even harder. So IMO having one/several more specific options
>> would not hurt a GSoC proposal. It would also make it easier to
>> evaluate the proposal.
>>
>> > Waiting for your comments and valuable suggestions.
>> >
>>
>> Hope my comments provided at least some valuable information.
>>
>> best
>> Rupert
>>
>> References:
>>
>> > [1] https://issues.apache.org/jira/browse/STANBOL-1037
>> > [2]
>> >
>> http://googleresearch.blogspot.com.es/2012/05/from-words-to-concepts-and-back.html
>> > [3] https://code.google.com/p/wiki-links/
>> [4] https://www.wikidata.org/wiki/Wikidata:Main_Page
>> [5] https://issues.apache.org/jira/browse/STANBOL-1046
>> [6]
>> http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en//pubs/archive/38389.pdf
>>
>> >
>> > --
>> >
>> > ------------------------------
>> > This message should be regarded as confidential. If you have received
>> this
>> > email in error please notify the sender and destroy it immediately.
>> > Statements of intent shall only become binding when confirmed in hard
>> copy
>> > by an authorised signatory.
>> >
>> > Zaizi Ltd is registered in England and Wales with the registration number
>> > 6440931. The Registered Office is 222 Westbourne Studios, 242 Acklam
>> Road,
>> > London W10 5JJ, UK.
>>
>>
>>
>> --
>> | Rupert Westenthaler             [email protected]
>> | Bodenlehenstraße 11                             ++43-699-11108907
>> | A-5500 Bischofshofen
>>
>
>
>
> --
> Thanks
> Harish



--
| Rupert Westenthaler             [email protected]
| Bodenlehenstraße 11                             ++43-699-11108907
| A-5500 Bischofshofen

Re: [GSOC Idea] Disambiguation algorithm for Apache Stanbol

Reply via email to