There are several dbpedia categories for each entity, in this case for Microsoft we have :
category:Companies_in_the_NASDAQ-100_Index category:Microsoft category:Software_companies_of_the_United_States category:Software_companies_based_in_Washington_(state) category:Companies_established_in_1975 category:1975_establishments_in_the_United_States category:Companies_based_in_Redmond,_Washington category:Multinational_companies_headquartered_in_the_United_States category:Cloud_computing_providers category:Companies_in_the_Dow_Jones_Industrial_Average So we also have "Companies based in Redmont,Washington" which could be matched. There is still other contextual information from dbpedia which can be used. For example for an Organization we could also include : dbpprop:industry = Software dbpprop:service = Online Service Providers and for a Person (that's for Barack Obama) : dbpedia-owl:profession: dbpedia:Author dbpedia:Constitutional_law dbpedia:Lawyer dbpedia:Community_organizing I'd like to continue investigating this as I think that it may have some value in increasing the number of coreference resolutions and I'd like to concentrate more on precision rather than recall since we already have a set of coreferences detected by the stanford nlp tool and this would be as an addition to that (at least this is how I would like to use it). Is it ok if I track this by opening a jira? I could update it to show my progress and also my conclusions and if it turns out that it was a bad idea then that's the situation at least I'll end up with more knowledge about Stanbol in the end :). 2014-02-05 15:39 GMT+02:00 Rafa Haro <rh...@apache.org>: > Hi Cristian, > > The approach sounds nice. I don't want to be the devil's advocate but I'm > just not sure about the recall using the dbpedia categories feature. For > example, your sentence could be also "Microsoft posted its 2013 earnings. > The Redmond's company made a huge profit". So, maybe including more > contextual information from dbpedia could increase the recall but of course > will reduce the precision. > > Cheers, > Rafa > > El 04/02/14 09:50, Cristian Petroaca escribió: > > Back with a more detailed description of the steps for making this kind of >> coreference work. >> >> I will be using references to the following text in the steps below in >> order to make things clearer : "Microsoft posted its 2013 earnings. The >> software company made a huge profit." >> >> 1. For every noun phrase in the text which has : >> a. a determinate pos which implies reference to an entity local to >> the >> text, such as "the, this, these") but not "another, every", etc which >> implies a reference to an entity outside of the text. >> b. having at least another noun aside from the main required noun >> which >> further describes it. For example I will not count "The company" as being >> a >> legitimate candidate since this could create a lot of false positives by >> considering the double meaning of some words such as "in the company of >> good people". >> "The software company" is a good candidate since we also have "software". >> >> 2. match the nouns in the noun phrase to the contents of the dbpedia >> categories of each named entity found prior to the location of the noun >> phrase in the text. >> The dbpedia categories are in the following format (for Microsoft for >> example) : "Software companies of the United States". >> So we try to match "software company" with that. >> First, as you can see, the main noun in the dbpedia category has a plural >> form and it's the same for all categories which I saw. I don't know if >> there's an easier way to do this but I thought of applying a lemmatizer on >> the category and the noun phrase in order for them to have a common >> denominator.This also works if the noun phrase itself has a plural form. >> >> Second, I'll need to use for comparison only the words in the category >> which are themselves nouns and not prepositions or determiners such as "of >> the".This means that I need to pos tag the categories contents as well. >> I was thinking of running the pos and lemma on the dbpedia categories when >> building the dbpedia backed entity hub and storing them for later use - I >> don't know how feasible this is at the moment. >> >> After this I can compare each noun in the noun phrase with the equivalent >> nouns in the categories and based on the number of matches I can create a >> confidence level. >> >> 3. match the noun of the noun phrase with the rdf:type from dbpedia of the >> named entity. If this matches increase the confidence level. >> >> 4. If there are multiple named entities which can match a certain noun >> phrase then link the noun phrase with the closest named entity prior to it >> in the text. >> >> What do you think? >> >> Cristian >> >> 2014-01-31 Cristian Petroaca <cristian.petro...@gmail.com>: >> >> Hi Rafa, >>> >>> I don't yet have a concrete heursitic but I'm working on it. I'll provide >>> it here so that you guys can give me a feedback on it. >>> >>> What are "locality" features? >>> >>> I looked at Bart and other coref tools such as ArkRef and CherryPicker >>> and >>> they don't provide such a coreference. >>> >>> Cristian >>> >>> >>> 2014-01-30 Rafa Haro <rh...@apache.org>: >>> >>> Hi Cristian, >>> >>>> Without having more details about your concrete heuristic, in my honest >>>> opinion, such approach could produce a lot of false positives. I don't >>>> know >>>> if you are planning to use some "locality" features to detect such >>>> coreferences but you need to take into account that it is quite usual >>>> that >>>> coreferenced mentions can occurs even in different paragraphs. Although >>>> I'm >>>> not an expert in Natural Language Understanding, I would say it is quite >>>> difficult to get decent precision/recall rates for coreferencing using >>>> fixed rules. Maybe you can give a try to others tools like BART ( >>>> http://www.bart-coref.org/). >>>> >>>> Cheers, >>>> Rafa Haro >>>> >>>> El 30/01/14 10:33, Cristian Petroaca escribió: >>>> >>>> Hi, >>>> >>>>> One of the necessary steps for implementing the Event extraction Engine >>>>> feature : https://issues.apache.org/jira/browse/STANBOL-1121 is to >>>>> have >>>>> coreference resolution in the given text. This is provided now via the >>>>> stanford-nlp project but as far as I saw this module is performing >>>>> mostly >>>>> pronomial (He, She) or nominal (Barack Obama and Mr. Obama) coreference >>>>> resolution. >>>>> >>>>> In order to get more coreferences from the text I though of creating >>>>> some >>>>> logic that would detect this kind of coreference : >>>>> "Apple reaches new profit heights. The software company just announced >>>>> its >>>>> 2013 earnings." >>>>> Here "The software company" obviously refers to "Apple". >>>>> So I'd like to detect coreferences of Named Entities which are of the >>>>> rdf:type of the Named Entity , in this case "company" and also have >>>>> attributes which can be found in the dbpedia categories of the named >>>>> entity, in this case "software". >>>>> >>>>> The detection of coreferences such as "The software company" in the >>>>> text >>>>> would also be done by either using the new Pos Tag Based Phrase >>>>> extraction >>>>> Engine (noun phrases) or by using a dependency tree of the sentence and >>>>> picking up only subjects or objects. >>>>> >>>>> At this point I'd like to know if this kind of logic would be useful >>>>> as a >>>>> separate Enhancement Engine (in case the precision and recall are good >>>>> enough) in Stanbol? >>>>> >>>>> Thanks, >>>>> Cristian >>>>> >>>>> >>>>> >