Hi Antonio First of all thx for your interest!
On Thu, Apr 25, 2013 at 4:04 PM, Antonio Perez <ape...@zaizi.com> wrote: > Hi everybody > > I'm Antonio David PĆ©rez, a new Zaizi team member and a student for a MSc at > the University of Seville. Lastly, I've been involved in the development of > a semantic CMS solution in a Spanish Company called Ximdex working with > several technologies like Apache Nutch, Apache Solr and also Apache Stanbol. > > Currently, I've been assigned to a project that involves different > technologies like Apache Stanbol and Apache ManifoldCF. So, related to > Stanbol, I'm interested in the disambiguation problem, so I would like to > prepare a proposal for GSoC about this topic. > If you do have already some experiences with Apache Stanbol, this would be fore sure a big help for a GSoC project. > I have been following last mails about disambiguation and WebID protocol. I > would be more interesting in develop disambiguation systems within Stanbol > using the major semantic knowledge bases. Actually, my initial idea is to > use Freebase with the aim to make it extensible to any other database like > Wikipedia and DBpedia. Following STANBOL-1037 [1], the main goal is to > implement a couple of global-approach disambiguation algorithms to be used > in Stanbol. > Disambiguation on "World Domain" datasets is a very important feature for a lot of usage scenarios. So definitely very interesting and relevant for Apache Stanbol. > For this, I would like to discuss some topics about the proposal: > > - Knowledge Base: I have decided to stick first to Freebase, because it has > a REST API allowing 100k calls per day for read and 10k for write. Besides > the REST API, an alternative could be to integrate the whole freebase graph > in Stanbol and use their Java API to manage it. Ideally, the management > framework should be valid for others knowledge bases as Wikipedia or > DBpedia. > I recently created my first Freebase index for Stanbol (see STANBOL-1014 for the Indexing tool). First test on an Index with all Freebase Topics and all languages have shown very nice result! IMO Freebase is currently for sure the better choice over DBpedia. However one needs to see/wait how Freebase compares to the Wikidata project [4] that only recently entered phase 2. Designing disambiguation in a way that it can be applied to other datasets would be for sure a great bonus. But given the good results one can get with Freebase I would even be very interested if the results would only work on Freebase ^^ > - Resources: As have been pointed before in the mailing lists, google has > released a couple of resources to be used in disambiguation applications. > One if a dictionary of concepts from Wikipedia, using anchor text labels in > Wikipedia internal links to create an index of entities possible names [2]. > The second one is a dataset of texts that links to concepts in the > Wikipedia [3] that can be used as disambiguation contexts according to > STANBOL-1037. I need to research if similar information can be retrieved > directly from freebase or , in other words, to check if this information is > already incorporated in Freebase. > I think you can even use [2] and [3] for disambiguation on top of Freebase as there is anyway a mapping between Freebase and DBpedia concepts. However you will likely need a higher quality mapping as it is currently available. Because of that I would suggest you to start of with implementing STANBOL-1046 [5]. For possible names (or surface forms as they are also often called) one can use the Alias in Freebase. However AFAIK there are no information available in Freebase similar to [3]. Related to this I fond however an interesting pager [6]. The semi-supervised approach suggested in chapter III could nicely work. Especially if one considers that users could manually disambiguate Entities. In combination with other mentions extracted by the Stanbol Enhancer this could be used to acquire the required data. > Moreover, the proposal design will try to be as generic as possible in > order to be adaptable to any other Knowledge Base. > Disambiguation is not something easy and making something "generic" makes it even harder. So IMO having one/several more specific options would not hurt a GSoC proposal. It would also make it easier to evaluate the proposal. > Waiting for your comments and valuable suggestions. > Hope my comments provided at least some valuable information. best Rupert References: > [1] https://issues.apache.org/jira/browse/STANBOL-1037 > [2] > http://googleresearch.blogspot.com.es/2012/05/from-words-to-concepts-and-back.html > [3] https://code.google.com/p/wiki-links/ [4] https://www.wikidata.org/wiki/Wikidata:Main_Page [5] https://issues.apache.org/jira/browse/STANBOL-1046 [6] http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en//pubs/archive/38389.pdf > > -- > > ------------------------------ > This message should be regarded as confidential. If you have received this > email in error please notify the sender and destroy it immediately. > Statements of intent shall only become binding when confirmed in hard copy > by an authorised signatory. > > Zaizi Ltd is registered in England and Wales with the registration number > 6440931. The Registered Office is 222 Westbourne Studios, 242 Acklam Road, > London W10 5JJ, UK. -- | Rupert Westenthaler rupert.westentha...@gmail.com | BodenlehenstraĆe 11 ++43-699-11108907 | A-5500 Bischofshofen