Opened https://issues.apache.org/jira/browse/STANBOL-1279
2014-02-07 10:53 GMT+02:00 Cristian Petroaca <cristian.petro...@gmail.com>: > Hi Rupert, > > The "spatial" dimension is a good idea. I'll also take a look at Yago. > > I will create a Jira with what we talked about here. It will probably have > just a draft-like description for now and will be updated as I go along. > > Thanks, > Cristian > > > 2014-02-06 15:39 GMT+02:00 Rupert Westenthaler < > rupert.westentha...@gmail.com>: > > Hi Cristian, >> >> definitely an interesting approach. You should have a look at Yago2 >> [1]. As far as I can remember the Yago taxonomy is much better >> structured as the one used by dbpedia. Mapping suggestions of dbpedia >> to concepts in Yago2 is easy as both dbpedia and yago2 do provide >> mappings [2] and [3] >> >> > 2014-02-05 15:39 GMT+02:00 Rafa Haro <rh...@apache.org>: >> >> >> >> "Microsoft posted its 2013 earnings. The Redmond's company made a >> >> huge profit". >> >> Thats actually a very good example. Spatial contexts are very >> important as they tend to be often used for referencing. So I would >> suggest to specially treat the spatial context. For spatial Entities >> (like a City) this is easy, but even for other (like a Person, >> Company) you could use relations to spatial entities define their >> spatial context. This context could than be used to correctly link >> "The Redmond's company" to "Microsoft". >> >> In addition I would suggest to use the "spatial" context of each >> entity (basically relation to entities that are cities, regions, >> countries) as a separate dimension, because those are very often used >> for coreferences. >> >> [1] http://www.mpi-inf.mpg.de/yago-naga/yago/ >> [2] http://downloads.dbpedia.org/3.9/links/yago_links.nt.bz2 >> [3] >> http://www.mpi-inf.mpg.de/yago-naga/yago/download/yago/yagoDBpediaInstances.ttl.7z >> >> >> On Thu, Feb 6, 2014 at 10:33 AM, Cristian Petroaca >> <cristian.petro...@gmail.com> wrote: >> > There are several dbpedia categories for each entity, in this case for >> > Microsoft we have : >> > >> > category:Companies_in_the_NASDAQ-100_Index >> > category:Microsoft >> > category:Software_companies_of_the_United_States >> > category:Software_companies_based_in_Washington_(state) >> > category:Companies_established_in_1975 >> > category:1975_establishments_in_the_United_States >> > category:Companies_based_in_Redmond,_Washington >> > category:Multinational_companies_headquartered_in_the_United_States >> > category:Cloud_computing_providers >> > category:Companies_in_the_Dow_Jones_Industrial_Average >> > >> > So we also have "Companies based in Redmont,Washington" which could be >> > matched. >> > >> > >> > There is still other contextual information from dbpedia which can be >> used. >> > For example for an Organization we could also include : >> > dbpprop:industry = Software >> > dbpprop:service = Online Service Providers >> > >> > and for a Person (that's for Barack Obama) : >> > >> > dbpedia-owl:profession: >> > dbpedia:Author >> > dbpedia:Constitutional_law >> > dbpedia:Lawyer >> > dbpedia:Community_organizing >> > >> > I'd like to continue investigating this as I think that it may have some >> > value in increasing the number of coreference resolutions and I'd like >> to >> > concentrate more on precision rather than recall since we already have a >> > set of coreferences detected by the stanford nlp tool and this would be >> as >> > an addition to that (at least this is how I would like to use it). >> > >> > Is it ok if I track this by opening a jira? I could update it to show my >> > progress and also my conclusions and if it turns out that it was a bad >> idea >> > then that's the situation at least I'll end up with more knowledge about >> > Stanbol in the end :). >> > >> > >> > 2014-02-05 15:39 GMT+02:00 Rafa Haro <rh...@apache.org>: >> > >> >> Hi Cristian, >> >> >> >> The approach sounds nice. I don't want to be the devil's advocate but >> I'm >> >> just not sure about the recall using the dbpedia categories feature. >> For >> >> example, your sentence could be also "Microsoft posted its 2013 >> earnings. >> >> The Redmond's company made a huge profit". So, maybe including more >> >> contextual information from dbpedia could increase the recall but of >> course >> >> will reduce the precision. >> >> >> >> Cheers, >> >> Rafa >> >> >> >> El 04/02/14 09:50, Cristian Petroaca escribió: >> >> >> >> Back with a more detailed description of the steps for making this >> kind of >> >>> coreference work. >> >>> >> >>> I will be using references to the following text in the steps below in >> >>> order to make things clearer : "Microsoft posted its 2013 earnings. >> The >> >>> software company made a huge profit." >> >>> >> >>> 1. For every noun phrase in the text which has : >> >>> a. a determinate pos which implies reference to an entity local >> to >> >>> the >> >>> text, such as "the, this, these") but not "another, every", etc which >> >>> implies a reference to an entity outside of the text. >> >>> b. having at least another noun aside from the main required noun >> >>> which >> >>> further describes it. For example I will not count "The company" as >> being >> >>> a >> >>> legitimate candidate since this could create a lot of false positives >> by >> >>> considering the double meaning of some words such as "in the company >> of >> >>> good people". >> >>> "The software company" is a good candidate since we also have >> "software". >> >>> >> >>> 2. match the nouns in the noun phrase to the contents of the dbpedia >> >>> categories of each named entity found prior to the location of the >> noun >> >>> phrase in the text. >> >>> The dbpedia categories are in the following format (for Microsoft for >> >>> example) : "Software companies of the United States". >> >>> So we try to match "software company" with that. >> >>> First, as you can see, the main noun in the dbpedia category has a >> plural >> >>> form and it's the same for all categories which I saw. I don't know if >> >>> there's an easier way to do this but I thought of applying a >> lemmatizer on >> >>> the category and the noun phrase in order for them to have a common >> >>> denominator.This also works if the noun phrase itself has a plural >> form. >> >>> >> >>> Second, I'll need to use for comparison only the words in the category >> >>> which are themselves nouns and not prepositions or determiners such >> as "of >> >>> the".This means that I need to pos tag the categories contents as >> well. >> >>> I was thinking of running the pos and lemma on the dbpedia categories >> when >> >>> building the dbpedia backed entity hub and storing them for later use >> - I >> >>> don't know how feasible this is at the moment. >> >>> >> >>> After this I can compare each noun in the noun phrase with the >> equivalent >> >>> nouns in the categories and based on the number of matches I can >> create a >> >>> confidence level. >> >>> >> >>> 3. match the noun of the noun phrase with the rdf:type from dbpedia >> of the >> >>> named entity. If this matches increase the confidence level. >> >>> >> >>> 4. If there are multiple named entities which can match a certain noun >> >>> phrase then link the noun phrase with the closest named entity prior >> to it >> >>> in the text. >> >>> >> >>> What do you think? >> >>> >> >>> Cristian >> >>> >> >>> 2014-01-31 Cristian Petroaca <cristian.petro...@gmail.com>: >> >>> >> >>> Hi Rafa, >> >>>> >> >>>> I don't yet have a concrete heursitic but I'm working on it. I'll >> provide >> >>>> it here so that you guys can give me a feedback on it. >> >>>> >> >>>> What are "locality" features? >> >>>> >> >>>> I looked at Bart and other coref tools such as ArkRef and >> CherryPicker >> >>>> and >> >>>> they don't provide such a coreference. >> >>>> >> >>>> Cristian >> >>>> >> >>>> >> >>>> 2014-01-30 Rafa Haro <rh...@apache.org>: >> >>>> >> >>>> Hi Cristian, >> >>>> >> >>>>> Without having more details about your concrete heuristic, in my >> honest >> >>>>> opinion, such approach could produce a lot of false positives. I >> don't >> >>>>> know >> >>>>> if you are planning to use some "locality" features to detect such >> >>>>> coreferences but you need to take into account that it is quite >> usual >> >>>>> that >> >>>>> coreferenced mentions can occurs even in different paragraphs. >> Although >> >>>>> I'm >> >>>>> not an expert in Natural Language Understanding, I would say it is >> quite >> >>>>> difficult to get decent precision/recall rates for coreferencing >> using >> >>>>> fixed rules. Maybe you can give a try to others tools like BART ( >> >>>>> http://www.bart-coref.org/). >> >>>>> >> >>>>> Cheers, >> >>>>> Rafa Haro >> >>>>> >> >>>>> El 30/01/14 10:33, Cristian Petroaca escribió: >> >>>>> >> >>>>> Hi, >> >>>>> >> >>>>>> One of the necessary steps for implementing the Event extraction >> Engine >> >>>>>> feature : https://issues.apache.org/jira/browse/STANBOL-1121 is to >> >>>>>> have >> >>>>>> coreference resolution in the given text. This is provided now via >> the >> >>>>>> stanford-nlp project but as far as I saw this module is performing >> >>>>>> mostly >> >>>>>> pronomial (He, She) or nominal (Barack Obama and Mr. Obama) >> coreference >> >>>>>> resolution. >> >>>>>> >> >>>>>> In order to get more coreferences from the text I though of >> creating >> >>>>>> some >> >>>>>> logic that would detect this kind of coreference : >> >>>>>> "Apple reaches new profit heights. The software company just >> announced >> >>>>>> its >> >>>>>> 2013 earnings." >> >>>>>> Here "The software company" obviously refers to "Apple". >> >>>>>> So I'd like to detect coreferences of Named Entities which are of >> the >> >>>>>> rdf:type of the Named Entity , in this case "company" and also have >> >>>>>> attributes which can be found in the dbpedia categories of the >> named >> >>>>>> entity, in this case "software". >> >>>>>> >> >>>>>> The detection of coreferences such as "The software company" in the >> >>>>>> text >> >>>>>> would also be done by either using the new Pos Tag Based Phrase >> >>>>>> extraction >> >>>>>> Engine (noun phrases) or by using a dependency tree of the >> sentence and >> >>>>>> picking up only subjects or objects. >> >>>>>> >> >>>>>> At this point I'd like to know if this kind of logic would be >> useful >> >>>>>> as a >> >>>>>> separate Enhancement Engine (in case the precision and recall are >> good >> >>>>>> enough) in Stanbol? >> >>>>>> >> >>>>>> Thanks, >> >>>>>> Cristian >> >>>>>> >> >>>>>> >> >>>>>> >> >> >> >> >> >> -- >> | Rupert Westenthaler rupert.westentha...@gmail.com >> | Bodenlehenstraße 11 ++43-699-11108907 >> | A-5500 Bischofshofen >> > >