Re: Named entity coref resolution based on dbpedia categories and rdf:type

Cristian Petroaca Sun, 09 Feb 2014 04:17:08 -0800

Opened https://issues.apache.org/jira/browse/STANBOL-1279



2014-02-07 10:53 GMT+02:00 Cristian Petroaca <[email protected]>:

> Hi Rupert,
>
> The "spatial" dimension is a good idea. I'll also take a look at Yago.
>
> I will create a Jira with what we talked about here. It will probably have
> just a draft-like description for now and will be updated as I go along.
>
> Thanks,
> Cristian
>
>
> 2014-02-06 15:39 GMT+02:00 Rupert Westenthaler <
> [email protected]>:
>
> Hi Cristian,
>>
>> definitely an interesting approach. You should have a look at Yago2
>> [1]. As far as I can remember the Yago taxonomy is much better
>> structured as the one used by dbpedia. Mapping suggestions of dbpedia
>> to concepts in Yago2 is easy as both dbpedia and yago2 do provide
>> mappings [2] and [3]
>>
>> > 2014-02-05 15:39 GMT+02:00 Rafa Haro <[email protected]>:
>> >>
>> >> "Microsoft posted its 2013 earnings. The Redmond's company made a
>> >> huge profit".
>>
>> Thats actually a very good example. Spatial contexts are very
>> important as they tend to be often used for referencing. So I would
>> suggest to specially treat the spatial context. For spatial Entities
>> (like a City) this is easy, but even for other (like a Person,
>> Company) you could use relations to spatial entities define their
>> spatial context. This context could than be used to correctly link
>> "The Redmond's company" to "Microsoft".
>>
>> In addition I would suggest to use the "spatial" context of each
>> entity (basically relation to entities that are cities, regions,
>> countries) as a separate dimension, because those are very often used
>> for coreferences.
>>
>> [1] http://www.mpi-inf.mpg.de/yago-naga/yago/
>> [2] http://downloads.dbpedia.org/3.9/links/yago_links.nt.bz2
>> [3]
>> http://www.mpi-inf.mpg.de/yago-naga/yago/download/yago/yagoDBpediaInstances.ttl.7z
>>
>>
>> On Thu, Feb 6, 2014 at 10:33 AM, Cristian Petroaca
>> <[email protected]> wrote:
>> > There are several dbpedia categories for each entity, in this case for
>> > Microsoft we have :
>> >
>> > category:Companies_in_the_NASDAQ-100_Index
>> > category:Microsoft
>> > category:Software_companies_of_the_United_States
>> > category:Software_companies_based_in_Washington_(state)
>> > category:Companies_established_in_1975
>> > category:1975_establishments_in_the_United_States
>> > category:Companies_based_in_Redmond,_Washington
>> > category:Multinational_companies_headquartered_in_the_United_States
>> > category:Cloud_computing_providers
>> > category:Companies_in_the_Dow_Jones_Industrial_Average
>> >
>> > So we also have "Companies based in Redmont,Washington" which could be
>> > matched.
>> >
>> >
>> > There is still other contextual information from dbpedia which can be
>> used.
>> > For example for an Organization we could also include :
>> > dbpprop:industry = Software
>> > dbpprop:service = Online Service Providers
>> >
>> > and for a Person (that's for Barack Obama) :
>> >
>> > dbpedia-owl:profession:
>> >                                dbpedia:Author
>> >                                dbpedia:Constitutional_law
>> >                                dbpedia:Lawyer
>> >                                dbpedia:Community_organizing
>> >
>> > I'd like to continue investigating this as I think that it may have some
>> > value in increasing the number of coreference resolutions and I'd like
>> to
>> > concentrate more on precision rather than recall since we already have a
>> > set of coreferences detected by the stanford nlp tool and this would be
>> as
>> > an addition to that (at least this is how I would like to use it).
>> >
>> > Is it ok if I track this by opening a jira? I could update it to show my
>> > progress and also my conclusions and if it turns out that it was a bad
>> idea
>> > then that's the situation at least I'll end up with more knowledge about
>> > Stanbol in the end :).
>> >
>> >
>> > 2014-02-05 15:39 GMT+02:00 Rafa Haro <[email protected]>:
>> >
>> >> Hi Cristian,
>> >>
>> >> The approach sounds nice. I don't want to be the devil's advocate but
>> I'm
>> >> just not sure about the recall using the dbpedia categories feature.
>> For
>> >> example, your sentence could be also "Microsoft posted its 2013
>> earnings.
>> >> The Redmond's company made a huge profit". So, maybe including more
>> >> contextual information from dbpedia could increase the recall but of
>> course
>> >> will reduce the precision.
>> >>
>> >> Cheers,
>> >> Rafa
>> >>
>> >> El 04/02/14 09:50, Cristian Petroaca escribió:
>> >>
>> >>  Back with a more detailed description of the steps for making this
>> kind of
>> >>> coreference work.
>> >>>
>> >>> I will be using references to the following text in the steps below in
>> >>> order to make things clearer : "Microsoft posted its 2013 earnings.
>> The
>> >>> software company made a huge profit."
>> >>>
>> >>> 1. For every noun phrase in the text which has :
>> >>>      a. a determinate pos which implies reference to an entity local
>> to
>> >>> the
>> >>> text, such as "the, this, these") but not "another, every", etc which
>> >>> implies a reference to an entity outside of the text.
>> >>>      b. having at least another noun aside from the main required noun
>> >>> which
>> >>> further describes it. For example I will not count "The company" as
>> being
>> >>> a
>> >>> legitimate candidate since this could create a lot of false positives
>> by
>> >>> considering the double meaning of some words such as "in the company
>> of
>> >>> good people".
>> >>> "The software company" is a good candidate since we also have
>> "software".
>> >>>
>> >>> 2. match the nouns in the noun phrase to the contents of the dbpedia
>> >>> categories of each named entity found prior to the location of the
>> noun
>> >>> phrase in the text.
>> >>> The dbpedia categories are in the following format (for Microsoft for
>> >>> example) : "Software companies of the United States".
>> >>>   So we try to match "software company" with that.
>> >>> First, as you can see, the main noun in the dbpedia category has a
>> plural
>> >>> form and it's the same for all categories which I saw. I don't know if
>> >>> there's an easier way to do this but I thought of applying a
>> lemmatizer on
>> >>> the category and the noun phrase in order for them to have a common
>> >>> denominator.This also works if the noun phrase itself has a plural
>> form.
>> >>>
>> >>> Second, I'll need to use for comparison only the words in the category
>> >>> which are themselves nouns and not prepositions or determiners such
>> as "of
>> >>> the".This means that I need to pos tag the categories contents as
>> well.
>> >>> I was thinking of running the pos and lemma on the dbpedia categories
>> when
>> >>> building the dbpedia backed entity hub and storing them for later use
>> - I
>> >>> don't know how feasible this is at the moment.
>> >>>
>> >>> After this I can compare each noun in the noun phrase with the
>> equivalent
>> >>> nouns in the categories and based on the number of matches I can
>> create a
>> >>> confidence level.
>> >>>
>> >>> 3. match the noun of the noun phrase with the rdf:type from dbpedia
>> of the
>> >>> named entity. If this matches increase the confidence level.
>> >>>
>> >>> 4. If there are multiple named entities which can match a certain noun
>> >>> phrase then link the noun phrase with the closest named entity prior
>> to it
>> >>> in the text.
>> >>>
>> >>> What do you think?
>> >>>
>> >>> Cristian
>> >>>
>> >>> 2014-01-31 Cristian Petroaca <[email protected]>:
>> >>>
>> >>>  Hi Rafa,
>> >>>>
>> >>>> I don't yet have a concrete heursitic but I'm working on it. I'll
>> provide
>> >>>> it here so that you guys can give me a feedback on it.
>> >>>>
>> >>>> What are "locality" features?
>> >>>>
>> >>>> I looked at Bart and other coref tools such as ArkRef and
>> CherryPicker
>> >>>> and
>> >>>> they don't provide such a coreference.
>> >>>>
>> >>>> Cristian
>> >>>>
>> >>>>
>> >>>> 2014-01-30 Rafa Haro <[email protected]>:
>> >>>>
>> >>>> Hi Cristian,
>> >>>>
>> >>>>> Without having more details about your concrete heuristic, in my
>> honest
>> >>>>> opinion, such approach could produce a lot of false positives. I
>> don't
>> >>>>> know
>> >>>>> if you are planning to use some "locality" features to detect such
>> >>>>> coreferences but you need to take into account that it is quite
>> usual
>> >>>>> that
>> >>>>> coreferenced mentions can occurs even in different paragraphs.
>> Although
>> >>>>> I'm
>> >>>>> not an expert in Natural Language Understanding, I would say it is
>> quite
>> >>>>> difficult to get decent precision/recall rates for coreferencing
>> using
>> >>>>> fixed rules. Maybe you can give a try to others tools like BART (
>> >>>>> http://www.bart-coref.org/).
>> >>>>>
>> >>>>> Cheers,
>> >>>>> Rafa Haro
>> >>>>>
>> >>>>> El 30/01/14 10:33, Cristian Petroaca escribió:
>> >>>>>
>> >>>>>   Hi,
>> >>>>>
>> >>>>>> One of the necessary steps for implementing the Event extraction
>> Engine
>> >>>>>> feature : https://issues.apache.org/jira/browse/STANBOL-1121 is to
>> >>>>>> have
>> >>>>>> coreference resolution in the given text. This is provided now via
>> the
>> >>>>>> stanford-nlp project but as far as I saw this module is performing
>> >>>>>> mostly
>> >>>>>> pronomial (He, She) or nominal (Barack Obama and Mr. Obama)
>> coreference
>> >>>>>> resolution.
>> >>>>>>
>> >>>>>> In order to get more coreferences from the text I though of
>> creating
>> >>>>>> some
>> >>>>>> logic that would detect this kind of coreference :
>> >>>>>> "Apple reaches new profit heights. The software company just
>> announced
>> >>>>>> its
>> >>>>>> 2013 earnings."
>> >>>>>> Here "The software company" obviously refers to "Apple".
>> >>>>>> So I'd like to detect coreferences of Named Entities which are of
>> the
>> >>>>>> rdf:type of the Named Entity , in this case "company" and also have
>> >>>>>> attributes which can be found in the dbpedia categories of the
>> named
>> >>>>>> entity, in this case "software".
>> >>>>>>
>> >>>>>> The detection of coreferences such as "The software company" in the
>> >>>>>> text
>> >>>>>> would also be done by either using the new Pos Tag Based Phrase
>> >>>>>> extraction
>> >>>>>> Engine (noun phrases) or by using a dependency tree of the
>> sentence and
>> >>>>>> picking up only subjects or objects.
>> >>>>>>
>> >>>>>> At this point I'd like to know if this kind of logic would be
>> useful
>> >>>>>> as a
>> >>>>>> separate Enhancement Engine (in case the precision and recall are
>> good
>> >>>>>> enough) in Stanbol?
>> >>>>>>
>> >>>>>> Thanks,
>> >>>>>> Cristian
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>
>>
>>
>>
>> --
>> | Rupert Westenthaler             [email protected]
>> | Bodenlehenstraße 11                             ++43-699-11108907
>> | A-5500 Bischofshofen
>>
>
>

Re: Named entity coref resolution based on dbpedia categories and rdf:type

Reply via email to