Re: Named entity coref resolution based on dbpedia categories and rdf:type

Cristian Petroaca Fri, 07 Feb 2014 00:54:59 -0800

Hi Rupert,

The "spatial" dimension is a good idea. I'll also take a look at Yago.


I will create a Jira with what we talked about here. It will probably have
just a draft-like description for now and will be updated as I go along.

Thanks,
Cristian


2014-02-06 15:39 GMT+02:00 Rupert Westenthaler <
[email protected]>:

> Hi Cristian,
>
> definitely an interesting approach. You should have a look at Yago2
> [1]. As far as I can remember the Yago taxonomy is much better
> structured as the one used by dbpedia. Mapping suggestions of dbpedia
> to concepts in Yago2 is easy as both dbpedia and yago2 do provide
> mappings [2] and [3]
>
> > 2014-02-05 15:39 GMT+02:00 Rafa Haro <[email protected]>:
> >>
> >> "Microsoft posted its 2013 earnings. The Redmond's company made a
> >> huge profit".
>
> Thats actually a very good example. Spatial contexts are very
> important as they tend to be often used for referencing. So I would
> suggest to specially treat the spatial context. For spatial Entities
> (like a City) this is easy, but even for other (like a Person,
> Company) you could use relations to spatial entities define their
> spatial context. This context could than be used to correctly link
> "The Redmond's company" to "Microsoft".
>
> In addition I would suggest to use the "spatial" context of each
> entity (basically relation to entities that are cities, regions,
> countries) as a separate dimension, because those are very often used
> for coreferences.
>
> [1] http://www.mpi-inf.mpg.de/yago-naga/yago/
> [2] http://downloads.dbpedia.org/3.9/links/yago_links.nt.bz2
> [3]
> http://www.mpi-inf.mpg.de/yago-naga/yago/download/yago/yagoDBpediaInstances.ttl.7z
>
>
> On Thu, Feb 6, 2014 at 10:33 AM, Cristian Petroaca
> <[email protected]> wrote:
> > There are several dbpedia categories for each entity, in this case for
> > Microsoft we have :
> >
> > category:Companies_in_the_NASDAQ-100_Index
> > category:Microsoft
> > category:Software_companies_of_the_United_States
> > category:Software_companies_based_in_Washington_(state)
> > category:Companies_established_in_1975
> > category:1975_establishments_in_the_United_States
> > category:Companies_based_in_Redmond,_Washington
> > category:Multinational_companies_headquartered_in_the_United_States
> > category:Cloud_computing_providers
> > category:Companies_in_the_Dow_Jones_Industrial_Average
> >
> > So we also have "Companies based in Redmont,Washington" which could be
> > matched.
> >
> >
> > There is still other contextual information from dbpedia which can be
> used.
> > For example for an Organization we could also include :
> > dbpprop:industry = Software
> > dbpprop:service = Online Service Providers
> >
> > and for a Person (that's for Barack Obama) :
> >
> > dbpedia-owl:profession:
> >                                dbpedia:Author
> >                                dbpedia:Constitutional_law
> >                                dbpedia:Lawyer
> >                                dbpedia:Community_organizing
> >
> > I'd like to continue investigating this as I think that it may have some
> > value in increasing the number of coreference resolutions and I'd like to
> > concentrate more on precision rather than recall since we already have a
> > set of coreferences detected by the stanford nlp tool and this would be
> as
> > an addition to that (at least this is how I would like to use it).
> >
> > Is it ok if I track this by opening a jira? I could update it to show my
> > progress and also my conclusions and if it turns out that it was a bad
> idea
> > then that's the situation at least I'll end up with more knowledge about
> > Stanbol in the end :).
> >
> >
> > 2014-02-05 15:39 GMT+02:00 Rafa Haro <[email protected]>:
> >
> >> Hi Cristian,
> >>
> >> The approach sounds nice. I don't want to be the devil's advocate but
> I'm
> >> just not sure about the recall using the dbpedia categories feature. For
> >> example, your sentence could be also "Microsoft posted its 2013
> earnings.
> >> The Redmond's company made a huge profit". So, maybe including more
> >> contextual information from dbpedia could increase the recall but of
> course
> >> will reduce the precision.
> >>
> >> Cheers,
> >> Rafa
> >>
> >> El 04/02/14 09:50, Cristian Petroaca escribió:
> >>
> >>  Back with a more detailed description of the steps for making this
> kind of
> >>> coreference work.
> >>>
> >>> I will be using references to the following text in the steps below in
> >>> order to make things clearer : "Microsoft posted its 2013 earnings. The
> >>> software company made a huge profit."
> >>>
> >>> 1. For every noun phrase in the text which has :
> >>>      a. a determinate pos which implies reference to an entity local to
> >>> the
> >>> text, such as "the, this, these") but not "another, every", etc which
> >>> implies a reference to an entity outside of the text.
> >>>      b. having at least another noun aside from the main required noun
> >>> which
> >>> further describes it. For example I will not count "The company" as
> being
> >>> a
> >>> legitimate candidate since this could create a lot of false positives
> by
> >>> considering the double meaning of some words such as "in the company of
> >>> good people".
> >>> "The software company" is a good candidate since we also have
> "software".
> >>>
> >>> 2. match the nouns in the noun phrase to the contents of the dbpedia
> >>> categories of each named entity found prior to the location of the noun
> >>> phrase in the text.
> >>> The dbpedia categories are in the following format (for Microsoft for
> >>> example) : "Software companies of the United States".
> >>>   So we try to match "software company" with that.
> >>> First, as you can see, the main noun in the dbpedia category has a
> plural
> >>> form and it's the same for all categories which I saw. I don't know if
> >>> there's an easier way to do this but I thought of applying a
> lemmatizer on
> >>> the category and the noun phrase in order for them to have a common
> >>> denominator.This also works if the noun phrase itself has a plural
> form.
> >>>
> >>> Second, I'll need to use for comparison only the words in the category
> >>> which are themselves nouns and not prepositions or determiners such as
> "of
> >>> the".This means that I need to pos tag the categories contents as well.
> >>> I was thinking of running the pos and lemma on the dbpedia categories
> when
> >>> building the dbpedia backed entity hub and storing them for later use
> - I
> >>> don't know how feasible this is at the moment.
> >>>
> >>> After this I can compare each noun in the noun phrase with the
> equivalent
> >>> nouns in the categories and based on the number of matches I can
> create a
> >>> confidence level.
> >>>
> >>> 3. match the noun of the noun phrase with the rdf:type from dbpedia of
> the
> >>> named entity. If this matches increase the confidence level.
> >>>
> >>> 4. If there are multiple named entities which can match a certain noun
> >>> phrase then link the noun phrase with the closest named entity prior
> to it
> >>> in the text.
> >>>
> >>> What do you think?
> >>>
> >>> Cristian
> >>>
> >>> 2014-01-31 Cristian Petroaca <[email protected]>:
> >>>
> >>>  Hi Rafa,
> >>>>
> >>>> I don't yet have a concrete heursitic but I'm working on it. I'll
> provide
> >>>> it here so that you guys can give me a feedback on it.
> >>>>
> >>>> What are "locality" features?
> >>>>
> >>>> I looked at Bart and other coref tools such as ArkRef and CherryPicker
> >>>> and
> >>>> they don't provide such a coreference.
> >>>>
> >>>> Cristian
> >>>>
> >>>>
> >>>> 2014-01-30 Rafa Haro <[email protected]>:
> >>>>
> >>>> Hi Cristian,
> >>>>
> >>>>> Without having more details about your concrete heuristic, in my
> honest
> >>>>> opinion, such approach could produce a lot of false positives. I
> don't
> >>>>> know
> >>>>> if you are planning to use some "locality" features to detect such
> >>>>> coreferences but you need to take into account that it is quite usual
> >>>>> that
> >>>>> coreferenced mentions can occurs even in different paragraphs.
> Although
> >>>>> I'm
> >>>>> not an expert in Natural Language Understanding, I would say it is
> quite
> >>>>> difficult to get decent precision/recall rates for coreferencing
> using
> >>>>> fixed rules. Maybe you can give a try to others tools like BART (
> >>>>> http://www.bart-coref.org/).
> >>>>>
> >>>>> Cheers,
> >>>>> Rafa Haro
> >>>>>
> >>>>> El 30/01/14 10:33, Cristian Petroaca escribió:
> >>>>>
> >>>>>   Hi,
> >>>>>
> >>>>>> One of the necessary steps for implementing the Event extraction
> Engine
> >>>>>> feature : https://issues.apache.org/jira/browse/STANBOL-1121 is to
> >>>>>> have
> >>>>>> coreference resolution in the given text. This is provided now via
> the
> >>>>>> stanford-nlp project but as far as I saw this module is performing
> >>>>>> mostly
> >>>>>> pronomial (He, She) or nominal (Barack Obama and Mr. Obama)
> coreference
> >>>>>> resolution.
> >>>>>>
> >>>>>> In order to get more coreferences from the text I though of creating
> >>>>>> some
> >>>>>> logic that would detect this kind of coreference :
> >>>>>> "Apple reaches new profit heights. The software company just
> announced
> >>>>>> its
> >>>>>> 2013 earnings."
> >>>>>> Here "The software company" obviously refers to "Apple".
> >>>>>> So I'd like to detect coreferences of Named Entities which are of
> the
> >>>>>> rdf:type of the Named Entity , in this case "company" and also have
> >>>>>> attributes which can be found in the dbpedia categories of the named
> >>>>>> entity, in this case "software".
> >>>>>>
> >>>>>> The detection of coreferences such as "The software company" in the
> >>>>>> text
> >>>>>> would also be done by either using the new Pos Tag Based Phrase
> >>>>>> extraction
> >>>>>> Engine (noun phrases) or by using a dependency tree of the sentence
> and
> >>>>>> picking up only subjects or objects.
> >>>>>>
> >>>>>> At this point I'd like to know if this kind of logic would be useful
> >>>>>> as a
> >>>>>> separate Enhancement Engine (in case the precision and recall are
> good
> >>>>>> enough) in Stanbol?
> >>>>>>
> >>>>>> Thanks,
> >>>>>> Cristian
> >>>>>>
> >>>>>>
> >>>>>>
> >>
>
>
>
> --
> | Rupert Westenthaler             [email protected]
> | Bodenlehenstraße 11                             ++43-699-11108907
> | A-5500 Bischofshofen
>

Re: Named entity coref resolution based on dbpedia categories and rdf:type

Reply via email to