Rupert, Is the freebase index available for use? From where I can get it? I can compare the entities I get from dbpedia and freebase on some of my test files.
Is [2] http://googleresearch.blogspot.com.es/2012/05/from-words-to-concepts-and-back.htmlused by any engine now in stanbol? Are there any programming apis avaialble to access the concept dictionary? -harish On Sat, Apr 27, 2013 at 7:35 AM, Rupert Westenthaler < rupert.westentha...@gmail.com> wrote: > Hi Antonio > > First of all thx for your interest! > > On Thu, Apr 25, 2013 at 4:04 PM, Antonio Perez <ape...@zaizi.com> wrote: > > Hi everybody > > > > I'm Antonio David PĆ©rez, a new Zaizi team member and a student for a MSc > at > > the University of Seville. Lastly, I've been involved in the development > of > > a semantic CMS solution in a Spanish Company called Ximdex working with > > several technologies like Apache Nutch, Apache Solr and also Apache > Stanbol. > > > > Currently, I've been assigned to a project that involves different > > technologies like Apache Stanbol and Apache ManifoldCF. So, related to > > Stanbol, I'm interested in the disambiguation problem, so I would like to > > prepare a proposal for GSoC about this topic. > > > > If you do have already some experiences with Apache Stanbol, this > would be fore sure a big help for a GSoC project. > > > I have been following last mails about disambiguation and WebID > protocol. I > > would be more interesting in develop disambiguation systems within > Stanbol > > using the major semantic knowledge bases. Actually, my initial idea is to > > use Freebase with the aim to make it extensible to any other database > like > > Wikipedia and DBpedia. Following STANBOL-1037 [1], the main goal is to > > implement a couple of global-approach disambiguation algorithms to be > used > > in Stanbol. > > > > Disambiguation on "World Domain" datasets is a very important feature > for a lot of usage scenarios. So definitely very interesting and > relevant for Apache Stanbol. > > > For this, I would like to discuss some topics about the proposal: > > > > - Knowledge Base: I have decided to stick first to Freebase, because it > has > > a REST API allowing 100k calls per day for read and 10k for write. > Besides > > the REST API, an alternative could be to integrate the whole freebase > graph > > in Stanbol and use their Java API to manage it. Ideally, the management > > framework should be valid for others knowledge bases as Wikipedia or > > DBpedia. > > > > I recently created my first Freebase index for Stanbol (see > STANBOL-1014 for the Indexing tool). First test on an Index with all > Freebase Topics and all languages have shown very nice result! IMO > Freebase is currently for sure the better choice over DBpedia. However > one needs to see/wait how Freebase compares to the Wikidata project > [4] that only recently entered phase 2. > > Designing disambiguation in a way that it can be applied to other > datasets would be for sure a great bonus. But given the good results > one can get with Freebase I would even be very interested if the > results would only work on Freebase ^^ > > > - Resources: As have been pointed before in the mailing lists, google has > > released a couple of resources to be used in disambiguation applications. > > One if a dictionary of concepts from Wikipedia, using anchor text labels > in > > Wikipedia internal links to create an index of entities possible names > [2]. > > The second one is a dataset of texts that links to concepts in the > > Wikipedia [3] that can be used as disambiguation contexts according to > > STANBOL-1037. I need to research if similar information can be retrieved > > directly from freebase or , in other words, to check if this information > is > > already incorporated in Freebase. > > > > I think you can even use [2] and [3] for disambiguation on top of > Freebase as there is anyway a mapping between Freebase and DBpedia > concepts. However you will likely need a higher quality mapping as it > is currently available. Because of that I would suggest you to start > of with implementing STANBOL-1046 [5]. For possible names (or surface > forms as they are also often called) one can use the Alias in > Freebase. However AFAIK there are no information available in Freebase > similar to [3]. Related to this I fond however an interesting pager > [6]. The semi-supervised approach suggested in chapter III could > nicely work. Especially if one considers that users could manually > disambiguate Entities. In combination with other mentions extracted by > the Stanbol Enhancer this could be used to acquire the required data. > > > Moreover, the proposal design will try to be as generic as possible in > > order to be adaptable to any other Knowledge Base. > > > > Disambiguation is not something easy and making something "generic" > makes it even harder. So IMO having one/several more specific options > would not hurt a GSoC proposal. It would also make it easier to > evaluate the proposal. > > > Waiting for your comments and valuable suggestions. > > > > Hope my comments provided at least some valuable information. > > best > Rupert > > References: > > > [1] https://issues.apache.org/jira/browse/STANBOL-1037 > > [2] > > > http://googleresearch.blogspot.com.es/2012/05/from-words-to-concepts-and-back.html > > [3] https://code.google.com/p/wiki-links/ > [4] https://www.wikidata.org/wiki/Wikidata:Main_Page > [5] https://issues.apache.org/jira/browse/STANBOL-1046 > [6] > http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en//pubs/archive/38389.pdf > > > > > -- > > > > ------------------------------ > > This message should be regarded as confidential. If you have received > this > > email in error please notify the sender and destroy it immediately. > > Statements of intent shall only become binding when confirmed in hard > copy > > by an authorised signatory. > > > > Zaizi Ltd is registered in England and Wales with the registration number > > 6440931. The Registered Office is 222 Westbourne Studios, 242 Acklam > Road, > > London W10 5JJ, UK. > > > > -- > | Rupert Westenthaler rupert.westentha...@gmail.com > | BodenlehenstraĆe 11 ++43-699-11108907 > | A-5500 Bischofshofen > -- Thanks Harish