Hi Cristian,

definitely an interesting approach. You should have a look at Yago2
[1]. As far as I can remember the Yago taxonomy is much better
structured as the one used by dbpedia. Mapping suggestions of dbpedia
to concepts in Yago2 is easy as both dbpedia and yago2 do provide
mappings [2] and [3]

> 2014-02-05 15:39 GMT+02:00 Rafa Haro <rh...@apache.org>:
>>
>> "Microsoft posted its 2013 earnings. The Redmond's company made a
>> huge profit".

Thats actually a very good example. Spatial contexts are very
important as they tend to be often used for referencing. So I would
suggest to specially treat the spatial context. For spatial Entities
(like a City) this is easy, but even for other (like a Person,
Company) you could use relations to spatial entities define their
spatial context. This context could than be used to correctly link
"The Redmond's company" to "Microsoft".

In addition I would suggest to use the "spatial" context of each
entity (basically relation to entities that are cities, regions,
countries) as a separate dimension, because those are very often used
for coreferences.

[1] http://www.mpi-inf.mpg.de/yago-naga/yago/
[2] http://downloads.dbpedia.org/3.9/links/yago_links.nt.bz2
[3] 
http://www.mpi-inf.mpg.de/yago-naga/yago/download/yago/yagoDBpediaInstances.ttl.7z


On Thu, Feb 6, 2014 at 10:33 AM, Cristian Petroaca
<cristian.petro...@gmail.com> wrote:
> There are several dbpedia categories for each entity, in this case for
> Microsoft we have :
>
> category:Companies_in_the_NASDAQ-100_Index
> category:Microsoft
> category:Software_companies_of_the_United_States
> category:Software_companies_based_in_Washington_(state)
> category:Companies_established_in_1975
> category:1975_establishments_in_the_United_States
> category:Companies_based_in_Redmond,_Washington
> category:Multinational_companies_headquartered_in_the_United_States
> category:Cloud_computing_providers
> category:Companies_in_the_Dow_Jones_Industrial_Average
>
> So we also have "Companies based in Redmont,Washington" which could be
> matched.
>
>
> There is still other contextual information from dbpedia which can be used.
> For example for an Organization we could also include :
> dbpprop:industry = Software
> dbpprop:service = Online Service Providers
>
> and for a Person (that's for Barack Obama) :
>
> dbpedia-owl:profession:
>                                dbpedia:Author
>                                dbpedia:Constitutional_law
>                                dbpedia:Lawyer
>                                dbpedia:Community_organizing
>
> I'd like to continue investigating this as I think that it may have some
> value in increasing the number of coreference resolutions and I'd like to
> concentrate more on precision rather than recall since we already have a
> set of coreferences detected by the stanford nlp tool and this would be as
> an addition to that (at least this is how I would like to use it).
>
> Is it ok if I track this by opening a jira? I could update it to show my
> progress and also my conclusions and if it turns out that it was a bad idea
> then that's the situation at least I'll end up with more knowledge about
> Stanbol in the end :).
>
>
> 2014-02-05 15:39 GMT+02:00 Rafa Haro <rh...@apache.org>:
>
>> Hi Cristian,
>>
>> The approach sounds nice. I don't want to be the devil's advocate but I'm
>> just not sure about the recall using the dbpedia categories feature. For
>> example, your sentence could be also "Microsoft posted its 2013 earnings.
>> The Redmond's company made a huge profit". So, maybe including more
>> contextual information from dbpedia could increase the recall but of course
>> will reduce the precision.
>>
>> Cheers,
>> Rafa
>>
>> El 04/02/14 09:50, Cristian Petroaca escribió:
>>
>>  Back with a more detailed description of the steps for making this kind of
>>> coreference work.
>>>
>>> I will be using references to the following text in the steps below in
>>> order to make things clearer : "Microsoft posted its 2013 earnings. The
>>> software company made a huge profit."
>>>
>>> 1. For every noun phrase in the text which has :
>>>      a. a determinate pos which implies reference to an entity local to
>>> the
>>> text, such as "the, this, these") but not "another, every", etc which
>>> implies a reference to an entity outside of the text.
>>>      b. having at least another noun aside from the main required noun
>>> which
>>> further describes it. For example I will not count "The company" as being
>>> a
>>> legitimate candidate since this could create a lot of false positives by
>>> considering the double meaning of some words such as "in the company of
>>> good people".
>>> "The software company" is a good candidate since we also have "software".
>>>
>>> 2. match the nouns in the noun phrase to the contents of the dbpedia
>>> categories of each named entity found prior to the location of the noun
>>> phrase in the text.
>>> The dbpedia categories are in the following format (for Microsoft for
>>> example) : "Software companies of the United States".
>>>   So we try to match "software company" with that.
>>> First, as you can see, the main noun in the dbpedia category has a plural
>>> form and it's the same for all categories which I saw. I don't know if
>>> there's an easier way to do this but I thought of applying a lemmatizer on
>>> the category and the noun phrase in order for them to have a common
>>> denominator.This also works if the noun phrase itself has a plural form.
>>>
>>> Second, I'll need to use for comparison only the words in the category
>>> which are themselves nouns and not prepositions or determiners such as "of
>>> the".This means that I need to pos tag the categories contents as well.
>>> I was thinking of running the pos and lemma on the dbpedia categories when
>>> building the dbpedia backed entity hub and storing them for later use - I
>>> don't know how feasible this is at the moment.
>>>
>>> After this I can compare each noun in the noun phrase with the equivalent
>>> nouns in the categories and based on the number of matches I can create a
>>> confidence level.
>>>
>>> 3. match the noun of the noun phrase with the rdf:type from dbpedia of the
>>> named entity. If this matches increase the confidence level.
>>>
>>> 4. If there are multiple named entities which can match a certain noun
>>> phrase then link the noun phrase with the closest named entity prior to it
>>> in the text.
>>>
>>> What do you think?
>>>
>>> Cristian
>>>
>>> 2014-01-31 Cristian Petroaca <cristian.petro...@gmail.com>:
>>>
>>>  Hi Rafa,
>>>>
>>>> I don't yet have a concrete heursitic but I'm working on it. I'll provide
>>>> it here so that you guys can give me a feedback on it.
>>>>
>>>> What are "locality" features?
>>>>
>>>> I looked at Bart and other coref tools such as ArkRef and CherryPicker
>>>> and
>>>> they don't provide such a coreference.
>>>>
>>>> Cristian
>>>>
>>>>
>>>> 2014-01-30 Rafa Haro <rh...@apache.org>:
>>>>
>>>> Hi Cristian,
>>>>
>>>>> Without having more details about your concrete heuristic, in my honest
>>>>> opinion, such approach could produce a lot of false positives. I don't
>>>>> know
>>>>> if you are planning to use some "locality" features to detect such
>>>>> coreferences but you need to take into account that it is quite usual
>>>>> that
>>>>> coreferenced mentions can occurs even in different paragraphs. Although
>>>>> I'm
>>>>> not an expert in Natural Language Understanding, I would say it is quite
>>>>> difficult to get decent precision/recall rates for coreferencing using
>>>>> fixed rules. Maybe you can give a try to others tools like BART (
>>>>> http://www.bart-coref.org/).
>>>>>
>>>>> Cheers,
>>>>> Rafa Haro
>>>>>
>>>>> El 30/01/14 10:33, Cristian Petroaca escribió:
>>>>>
>>>>>   Hi,
>>>>>
>>>>>> One of the necessary steps for implementing the Event extraction Engine
>>>>>> feature : https://issues.apache.org/jira/browse/STANBOL-1121 is to
>>>>>> have
>>>>>> coreference resolution in the given text. This is provided now via the
>>>>>> stanford-nlp project but as far as I saw this module is performing
>>>>>> mostly
>>>>>> pronomial (He, She) or nominal (Barack Obama and Mr. Obama) coreference
>>>>>> resolution.
>>>>>>
>>>>>> In order to get more coreferences from the text I though of creating
>>>>>> some
>>>>>> logic that would detect this kind of coreference :
>>>>>> "Apple reaches new profit heights. The software company just announced
>>>>>> its
>>>>>> 2013 earnings."
>>>>>> Here "The software company" obviously refers to "Apple".
>>>>>> So I'd like to detect coreferences of Named Entities which are of the
>>>>>> rdf:type of the Named Entity , in this case "company" and also have
>>>>>> attributes which can be found in the dbpedia categories of the named
>>>>>> entity, in this case "software".
>>>>>>
>>>>>> The detection of coreferences such as "The software company" in the
>>>>>> text
>>>>>> would also be done by either using the new Pos Tag Based Phrase
>>>>>> extraction
>>>>>> Engine (noun phrases) or by using a dependency tree of the sentence and
>>>>>> picking up only subjects or objects.
>>>>>>
>>>>>> At this point I'd like to know if this kind of logic would be useful
>>>>>> as a
>>>>>> separate Enhancement Engine (in case the precision and recall are good
>>>>>> enough) in Stanbol?
>>>>>>
>>>>>> Thanks,
>>>>>> Cristian
>>>>>>
>>>>>>
>>>>>>
>>



-- 
| Rupert Westenthaler             rupert.westentha...@gmail.com
| Bodenlehenstraße 11                             ++43-699-11108907
| A-5500 Bischofshofen

Reply via email to