Hi Rupert, The "spatial" dimension is a good idea. I'll also take a look at Yago.
I will create a Jira with what we talked about here. It will probably have just a draft-like description for now and will be updated as I go along. Thanks, Cristian 2014-02-06 15:39 GMT+02:00 Rupert Westenthaler < rupert.westentha...@gmail.com>: > Hi Cristian, > > definitely an interesting approach. You should have a look at Yago2 > [1]. As far as I can remember the Yago taxonomy is much better > structured as the one used by dbpedia. Mapping suggestions of dbpedia > to concepts in Yago2 is easy as both dbpedia and yago2 do provide > mappings [2] and [3] > > > 2014-02-05 15:39 GMT+02:00 Rafa Haro <rh...@apache.org>: > >> > >> "Microsoft posted its 2013 earnings. The Redmond's company made a > >> huge profit". > > Thats actually a very good example. Spatial contexts are very > important as they tend to be often used for referencing. So I would > suggest to specially treat the spatial context. For spatial Entities > (like a City) this is easy, but even for other (like a Person, > Company) you could use relations to spatial entities define their > spatial context. This context could than be used to correctly link > "The Redmond's company" to "Microsoft". > > In addition I would suggest to use the "spatial" context of each > entity (basically relation to entities that are cities, regions, > countries) as a separate dimension, because those are very often used > for coreferences. > > [1] http://www.mpi-inf.mpg.de/yago-naga/yago/ > [2] http://downloads.dbpedia.org/3.9/links/yago_links.nt.bz2 > [3] > http://www.mpi-inf.mpg.de/yago-naga/yago/download/yago/yagoDBpediaInstances.ttl.7z > > > On Thu, Feb 6, 2014 at 10:33 AM, Cristian Petroaca > <cristian.petro...@gmail.com> wrote: > > There are several dbpedia categories for each entity, in this case for > > Microsoft we have : > > > > category:Companies_in_the_NASDAQ-100_Index > > category:Microsoft > > category:Software_companies_of_the_United_States > > category:Software_companies_based_in_Washington_(state) > > category:Companies_established_in_1975 > > category:1975_establishments_in_the_United_States > > category:Companies_based_in_Redmond,_Washington > > category:Multinational_companies_headquartered_in_the_United_States > > category:Cloud_computing_providers > > category:Companies_in_the_Dow_Jones_Industrial_Average > > > > So we also have "Companies based in Redmont,Washington" which could be > > matched. > > > > > > There is still other contextual information from dbpedia which can be > used. > > For example for an Organization we could also include : > > dbpprop:industry = Software > > dbpprop:service = Online Service Providers > > > > and for a Person (that's for Barack Obama) : > > > > dbpedia-owl:profession: > > dbpedia:Author > > dbpedia:Constitutional_law > > dbpedia:Lawyer > > dbpedia:Community_organizing > > > > I'd like to continue investigating this as I think that it may have some > > value in increasing the number of coreference resolutions and I'd like to > > concentrate more on precision rather than recall since we already have a > > set of coreferences detected by the stanford nlp tool and this would be > as > > an addition to that (at least this is how I would like to use it). > > > > Is it ok if I track this by opening a jira? I could update it to show my > > progress and also my conclusions and if it turns out that it was a bad > idea > > then that's the situation at least I'll end up with more knowledge about > > Stanbol in the end :). > > > > > > 2014-02-05 15:39 GMT+02:00 Rafa Haro <rh...@apache.org>: > > > >> Hi Cristian, > >> > >> The approach sounds nice. I don't want to be the devil's advocate but > I'm > >> just not sure about the recall using the dbpedia categories feature. For > >> example, your sentence could be also "Microsoft posted its 2013 > earnings. > >> The Redmond's company made a huge profit". So, maybe including more > >> contextual information from dbpedia could increase the recall but of > course > >> will reduce the precision. > >> > >> Cheers, > >> Rafa > >> > >> El 04/02/14 09:50, Cristian Petroaca escribió: > >> > >> Back with a more detailed description of the steps for making this > kind of > >>> coreference work. > >>> > >>> I will be using references to the following text in the steps below in > >>> order to make things clearer : "Microsoft posted its 2013 earnings. The > >>> software company made a huge profit." > >>> > >>> 1. For every noun phrase in the text which has : > >>> a. a determinate pos which implies reference to an entity local to > >>> the > >>> text, such as "the, this, these") but not "another, every", etc which > >>> implies a reference to an entity outside of the text. > >>> b. having at least another noun aside from the main required noun > >>> which > >>> further describes it. For example I will not count "The company" as > being > >>> a > >>> legitimate candidate since this could create a lot of false positives > by > >>> considering the double meaning of some words such as "in the company of > >>> good people". > >>> "The software company" is a good candidate since we also have > "software". > >>> > >>> 2. match the nouns in the noun phrase to the contents of the dbpedia > >>> categories of each named entity found prior to the location of the noun > >>> phrase in the text. > >>> The dbpedia categories are in the following format (for Microsoft for > >>> example) : "Software companies of the United States". > >>> So we try to match "software company" with that. > >>> First, as you can see, the main noun in the dbpedia category has a > plural > >>> form and it's the same for all categories which I saw. I don't know if > >>> there's an easier way to do this but I thought of applying a > lemmatizer on > >>> the category and the noun phrase in order for them to have a common > >>> denominator.This also works if the noun phrase itself has a plural > form. > >>> > >>> Second, I'll need to use for comparison only the words in the category > >>> which are themselves nouns and not prepositions or determiners such as > "of > >>> the".This means that I need to pos tag the categories contents as well. > >>> I was thinking of running the pos and lemma on the dbpedia categories > when > >>> building the dbpedia backed entity hub and storing them for later use > - I > >>> don't know how feasible this is at the moment. > >>> > >>> After this I can compare each noun in the noun phrase with the > equivalent > >>> nouns in the categories and based on the number of matches I can > create a > >>> confidence level. > >>> > >>> 3. match the noun of the noun phrase with the rdf:type from dbpedia of > the > >>> named entity. If this matches increase the confidence level. > >>> > >>> 4. If there are multiple named entities which can match a certain noun > >>> phrase then link the noun phrase with the closest named entity prior > to it > >>> in the text. > >>> > >>> What do you think? > >>> > >>> Cristian > >>> > >>> 2014-01-31 Cristian Petroaca <cristian.petro...@gmail.com>: > >>> > >>> Hi Rafa, > >>>> > >>>> I don't yet have a concrete heursitic but I'm working on it. I'll > provide > >>>> it here so that you guys can give me a feedback on it. > >>>> > >>>> What are "locality" features? > >>>> > >>>> I looked at Bart and other coref tools such as ArkRef and CherryPicker > >>>> and > >>>> they don't provide such a coreference. > >>>> > >>>> Cristian > >>>> > >>>> > >>>> 2014-01-30 Rafa Haro <rh...@apache.org>: > >>>> > >>>> Hi Cristian, > >>>> > >>>>> Without having more details about your concrete heuristic, in my > honest > >>>>> opinion, such approach could produce a lot of false positives. I > don't > >>>>> know > >>>>> if you are planning to use some "locality" features to detect such > >>>>> coreferences but you need to take into account that it is quite usual > >>>>> that > >>>>> coreferenced mentions can occurs even in different paragraphs. > Although > >>>>> I'm > >>>>> not an expert in Natural Language Understanding, I would say it is > quite > >>>>> difficult to get decent precision/recall rates for coreferencing > using > >>>>> fixed rules. Maybe you can give a try to others tools like BART ( > >>>>> http://www.bart-coref.org/). > >>>>> > >>>>> Cheers, > >>>>> Rafa Haro > >>>>> > >>>>> El 30/01/14 10:33, Cristian Petroaca escribió: > >>>>> > >>>>> Hi, > >>>>> > >>>>>> One of the necessary steps for implementing the Event extraction > Engine > >>>>>> feature : https://issues.apache.org/jira/browse/STANBOL-1121 is to > >>>>>> have > >>>>>> coreference resolution in the given text. This is provided now via > the > >>>>>> stanford-nlp project but as far as I saw this module is performing > >>>>>> mostly > >>>>>> pronomial (He, She) or nominal (Barack Obama and Mr. Obama) > coreference > >>>>>> resolution. > >>>>>> > >>>>>> In order to get more coreferences from the text I though of creating > >>>>>> some > >>>>>> logic that would detect this kind of coreference : > >>>>>> "Apple reaches new profit heights. The software company just > announced > >>>>>> its > >>>>>> 2013 earnings." > >>>>>> Here "The software company" obviously refers to "Apple". > >>>>>> So I'd like to detect coreferences of Named Entities which are of > the > >>>>>> rdf:type of the Named Entity , in this case "company" and also have > >>>>>> attributes which can be found in the dbpedia categories of the named > >>>>>> entity, in this case "software". > >>>>>> > >>>>>> The detection of coreferences such as "The software company" in the > >>>>>> text > >>>>>> would also be done by either using the new Pos Tag Based Phrase > >>>>>> extraction > >>>>>> Engine (noun phrases) or by using a dependency tree of the sentence > and > >>>>>> picking up only subjects or objects. > >>>>>> > >>>>>> At this point I'd like to know if this kind of logic would be useful > >>>>>> as a > >>>>>> separate Enhancement Engine (in case the precision and recall are > good > >>>>>> enough) in Stanbol? > >>>>>> > >>>>>> Thanks, > >>>>>> Cristian > >>>>>> > >>>>>> > >>>>>> > >> > > > > -- > | Rupert Westenthaler rupert.westentha...@gmail.com > | Bodenlehenstraße 11 ++43-699-11108907 > | A-5500 Bischofshofen >