Re: Named entity coref resolution based on dbpedia categories and rdf:type

Rupert Westenthaler Mon, 10 Mar 2014 04:30:13 -0700

Hallo Cristian,

NounPhrases are not added to the RDF enhancement results. You need to
use the AnalyzedText ContentPart [1]


here is some demo code you can use in the computeEnhancement method

        AnalysedText at = NlpEngineHelper.getAnalysedText(this, ci, true);
        Iterator<? extends Section> sections = at.getSentences();
        if(!sections.hasNext()){ //process as single sentence
            sections = Collections.singleton(at).iterator();
        }

        while(sections.hasNext()){
            Section section = sections.next();
            Iterator<Span> chunks =
section.getEnclosed(EnumSet.of(SpanTypeEnum.Chunk));
            while(chunks.hasNext()){
                Span chunk = chunks.next();
                Value<PhraseTag> phrase =
chunk.getAnnotation(NlpAnnotations.PHRASE_ANNOTATION);
                if(phrase.value().getCategory() == LexicalCategory.Noun){
                    log.info(" - NounPhrase [{},{}] {}", new Object[]{
                            chunk.getStart(),chunk.getEnd(),chunk.getSpan()});
                }
            }
        }

hope this helps

best
Rupert

[1] http://stanbol.apache.org/docs/trunk/components/enhancer/nlp/analyzedtext

On Sun, Mar 9, 2014 at 6:07 PM, Cristian Petroaca
<[email protected]> wrote:
> I started to implement the engine and I'm having problems with getting
> results for noun phrases. I modified the "default" weighted chain to also
> include the PosChunkerEngine and ran a sample text : "Angela Merkel visted
> China. The german chancellor met with various people". I expected that the
> RDF XML output would contain some info about the noun phrases but I cannot
> see any.
> Could you point me to the correct way to generate the noun phrases?
>
> Thanks,
> Cristian
>
>
> 2014-02-09 14:15 GMT+02:00 Cristian Petroaca <[email protected]>:
>
>> Opened https://issues.apache.org/jira/browse/STANBOL-1279
>>
>>
>> 2014-02-07 10:53 GMT+02:00 Cristian Petroaca <[email protected]>
>> :
>>
>> Hi Rupert,
>>>
>>> The "spatial" dimension is a good idea. I'll also take a look at Yago.
>>>
>>> I will create a Jira with what we talked about here. It will probably
>>> have just a draft-like description for now and will be updated as I go
>>> along.
>>>
>>> Thanks,
>>> Cristian
>>>
>>>
>>> 2014-02-06 15:39 GMT+02:00 Rupert Westenthaler <
>>> [email protected]>:
>>>
>>> Hi Cristian,
>>>>
>>>> definitely an interesting approach. You should have a look at Yago2
>>>> [1]. As far as I can remember the Yago taxonomy is much better
>>>> structured as the one used by dbpedia. Mapping suggestions of dbpedia
>>>> to concepts in Yago2 is easy as both dbpedia and yago2 do provide
>>>> mappings [2] and [3]
>>>>
>>>> > 2014-02-05 15:39 GMT+02:00 Rafa Haro <[email protected]>:
>>>> >>
>>>> >> "Microsoft posted its 2013 earnings. The Redmond's company made a
>>>> >> huge profit".
>>>>
>>>> Thats actually a very good example. Spatial contexts are very
>>>> important as they tend to be often used for referencing. So I would
>>>> suggest to specially treat the spatial context. For spatial Entities
>>>> (like a City) this is easy, but even for other (like a Person,
>>>> Company) you could use relations to spatial entities define their
>>>> spatial context. This context could than be used to correctly link
>>>> "The Redmond's company" to "Microsoft".
>>>>
>>>> In addition I would suggest to use the "spatial" context of each
>>>> entity (basically relation to entities that are cities, regions,
>>>> countries) as a separate dimension, because those are very often used
>>>> for coreferences.
>>>>
>>>> [1] http://www.mpi-inf.mpg.de/yago-naga/yago/
>>>> [2] http://downloads.dbpedia.org/3.9/links/yago_links.nt.bz2
>>>> [3]
>>>> http://www.mpi-inf.mpg.de/yago-naga/yago/download/yago/yagoDBpediaInstances.ttl.7z
>>>>
>>>>
>>>> On Thu, Feb 6, 2014 at 10:33 AM, Cristian Petroaca
>>>> <[email protected]> wrote:
>>>> > There are several dbpedia categories for each entity, in this case for
>>>> > Microsoft we have :
>>>> >
>>>> > category:Companies_in_the_NASDAQ-100_Index
>>>> > category:Microsoft
>>>> > category:Software_companies_of_the_United_States
>>>> > category:Software_companies_based_in_Washington_(state)
>>>> > category:Companies_established_in_1975
>>>> > category:1975_establishments_in_the_United_States
>>>> > category:Companies_based_in_Redmond,_Washington
>>>> > category:Multinational_companies_headquartered_in_the_United_States
>>>> > category:Cloud_computing_providers
>>>> > category:Companies_in_the_Dow_Jones_Industrial_Average
>>>> >
>>>> > So we also have "Companies based in Redmont,Washington" which could be
>>>> > matched.
>>>> >
>>>> >
>>>> > There is still other contextual information from dbpedia which can be
>>>> used.
>>>> > For example for an Organization we could also include :
>>>> > dbpprop:industry = Software
>>>> > dbpprop:service = Online Service Providers
>>>> >
>>>> > and for a Person (that's for Barack Obama) :
>>>> >
>>>> > dbpedia-owl:profession:
>>>> >                                dbpedia:Author
>>>> >                                dbpedia:Constitutional_law
>>>> >                                dbpedia:Lawyer
>>>> >                                dbpedia:Community_organizing
>>>> >
>>>> > I'd like to continue investigating this as I think that it may have
>>>> some
>>>> > value in increasing the number of coreference resolutions and I'd like
>>>> to
>>>> > concentrate more on precision rather than recall since we already have
>>>> a
>>>> > set of coreferences detected by the stanford nlp tool and this would
>>>> be as
>>>> > an addition to that (at least this is how I would like to use it).
>>>> >
>>>> > Is it ok if I track this by opening a jira? I could update it to show
>>>> my
>>>> > progress and also my conclusions and if it turns out that it was a bad
>>>> idea
>>>> > then that's the situation at least I'll end up with more knowledge
>>>> about
>>>> > Stanbol in the end :).
>>>> >
>>>> >
>>>> > 2014-02-05 15:39 GMT+02:00 Rafa Haro <[email protected]>:
>>>> >
>>>> >> Hi Cristian,
>>>> >>
>>>> >> The approach sounds nice. I don't want to be the devil's advocate but
>>>> I'm
>>>> >> just not sure about the recall using the dbpedia categories feature.
>>>> For
>>>> >> example, your sentence could be also "Microsoft posted its 2013
>>>> earnings.
>>>> >> The Redmond's company made a huge profit". So, maybe including more
>>>> >> contextual information from dbpedia could increase the recall but of
>>>> course
>>>> >> will reduce the precision.
>>>> >>
>>>> >> Cheers,
>>>> >> Rafa
>>>> >>
>>>> >> El 04/02/14 09:50, Cristian Petroaca escribió:
>>>> >>
>>>> >>  Back with a more detailed description of the steps for making this
>>>> kind of
>>>> >>> coreference work.
>>>> >>>
>>>> >>> I will be using references to the following text in the steps below
>>>> in
>>>> >>> order to make things clearer : "Microsoft posted its 2013 earnings.
>>>> The
>>>> >>> software company made a huge profit."
>>>> >>>
>>>> >>> 1. For every noun phrase in the text which has :
>>>> >>>      a. a determinate pos which implies reference to an entity local
>>>> to
>>>> >>> the
>>>> >>> text, such as "the, this, these") but not "another, every", etc which
>>>> >>> implies a reference to an entity outside of the text.
>>>> >>>      b. having at least another noun aside from the main required
>>>> noun
>>>> >>> which
>>>> >>> further describes it. For example I will not count "The company" as
>>>> being
>>>> >>> a
>>>> >>> legitimate candidate since this could create a lot of false
>>>> positives by
>>>> >>> considering the double meaning of some words such as "in the company
>>>> of
>>>> >>> good people".
>>>> >>> "The software company" is a good candidate since we also have
>>>> "software".
>>>> >>>
>>>> >>> 2. match the nouns in the noun phrase to the contents of the dbpedia
>>>> >>> categories of each named entity found prior to the location of the
>>>> noun
>>>> >>> phrase in the text.
>>>> >>> The dbpedia categories are in the following format (for Microsoft for
>>>> >>> example) : "Software companies of the United States".
>>>> >>>   So we try to match "software company" with that.
>>>> >>> First, as you can see, the main noun in the dbpedia category has a
>>>> plural
>>>> >>> form and it's the same for all categories which I saw. I don't know
>>>> if
>>>> >>> there's an easier way to do this but I thought of applying a
>>>> lemmatizer on
>>>> >>> the category and the noun phrase in order for them to have a common
>>>> >>> denominator.This also works if the noun phrase itself has a plural
>>>> form.
>>>> >>>
>>>> >>> Second, I'll need to use for comparison only the words in the
>>>> category
>>>> >>> which are themselves nouns and not prepositions or determiners such
>>>> as "of
>>>> >>> the".This means that I need to pos tag the categories contents as
>>>> well.
>>>> >>> I was thinking of running the pos and lemma on the dbpedia
>>>> categories when
>>>> >>> building the dbpedia backed entity hub and storing them for later
>>>> use - I
>>>> >>> don't know how feasible this is at the moment.
>>>> >>>
>>>> >>> After this I can compare each noun in the noun phrase with the
>>>> equivalent
>>>> >>> nouns in the categories and based on the number of matches I can
>>>> create a
>>>> >>> confidence level.
>>>> >>>
>>>> >>> 3. match the noun of the noun phrase with the rdf:type from dbpedia
>>>> of the
>>>> >>> named entity. If this matches increase the confidence level.
>>>> >>>
>>>> >>> 4. If there are multiple named entities which can match a certain
>>>> noun
>>>> >>> phrase then link the noun phrase with the closest named entity prior
>>>> to it
>>>> >>> in the text.
>>>> >>>
>>>> >>> What do you think?
>>>> >>>
>>>> >>> Cristian
>>>> >>>
>>>> >>> 2014-01-31 Cristian Petroaca <[email protected]>:
>>>> >>>
>>>> >>>  Hi Rafa,
>>>> >>>>
>>>> >>>> I don't yet have a concrete heursitic but I'm working on it. I'll
>>>> provide
>>>> >>>> it here so that you guys can give me a feedback on it.
>>>> >>>>
>>>> >>>> What are "locality" features?
>>>> >>>>
>>>> >>>> I looked at Bart and other coref tools such as ArkRef and
>>>> CherryPicker
>>>> >>>> and
>>>> >>>> they don't provide such a coreference.
>>>> >>>>
>>>> >>>> Cristian
>>>> >>>>
>>>> >>>>
>>>> >>>> 2014-01-30 Rafa Haro <[email protected]>:
>>>> >>>>
>>>> >>>> Hi Cristian,
>>>> >>>>
>>>> >>>>> Without having more details about your concrete heuristic, in my
>>>> honest
>>>> >>>>> opinion, such approach could produce a lot of false positives. I
>>>> don't
>>>> >>>>> know
>>>> >>>>> if you are planning to use some "locality" features to detect such
>>>> >>>>> coreferences but you need to take into account that it is quite
>>>> usual
>>>> >>>>> that
>>>> >>>>> coreferenced mentions can occurs even in different paragraphs.
>>>> Although
>>>> >>>>> I'm
>>>> >>>>> not an expert in Natural Language Understanding, I would say it is
>>>> quite
>>>> >>>>> difficult to get decent precision/recall rates for coreferencing
>>>> using
>>>> >>>>> fixed rules. Maybe you can give a try to others tools like BART (
>>>> >>>>> http://www.bart-coref.org/).
>>>> >>>>>
>>>> >>>>> Cheers,
>>>> >>>>> Rafa Haro
>>>> >>>>>
>>>> >>>>> El 30/01/14 10:33, Cristian Petroaca escribió:
>>>> >>>>>
>>>> >>>>>   Hi,
>>>> >>>>>
>>>> >>>>>> One of the necessary steps for implementing the Event extraction
>>>> Engine
>>>> >>>>>> feature : https://issues.apache.org/jira/browse/STANBOL-1121 is
>>>> to
>>>> >>>>>> have
>>>> >>>>>> coreference resolution in the given text. This is provided now
>>>> via the
>>>> >>>>>> stanford-nlp project but as far as I saw this module is performing
>>>> >>>>>> mostly
>>>> >>>>>> pronomial (He, She) or nominal (Barack Obama and Mr. Obama)
>>>> coreference
>>>> >>>>>> resolution.
>>>> >>>>>>
>>>> >>>>>> In order to get more coreferences from the text I though of
>>>> creating
>>>> >>>>>> some
>>>> >>>>>> logic that would detect this kind of coreference :
>>>> >>>>>> "Apple reaches new profit heights. The software company just
>>>> announced
>>>> >>>>>> its
>>>> >>>>>> 2013 earnings."
>>>> >>>>>> Here "The software company" obviously refers to "Apple".
>>>> >>>>>> So I'd like to detect coreferences of Named Entities which are of
>>>> the
>>>> >>>>>> rdf:type of the Named Entity , in this case "company" and also
>>>> have
>>>> >>>>>> attributes which can be found in the dbpedia categories of the
>>>> named
>>>> >>>>>> entity, in this case "software".
>>>> >>>>>>
>>>> >>>>>> The detection of coreferences such as "The software company" in
>>>> the
>>>> >>>>>> text
>>>> >>>>>> would also be done by either using the new Pos Tag Based Phrase
>>>> >>>>>> extraction
>>>> >>>>>> Engine (noun phrases) or by using a dependency tree of the
>>>> sentence and
>>>> >>>>>> picking up only subjects or objects.
>>>> >>>>>>
>>>> >>>>>> At this point I'd like to know if this kind of logic would be
>>>> useful
>>>> >>>>>> as a
>>>> >>>>>> separate Enhancement Engine (in case the precision and recall are
>>>> good
>>>> >>>>>> enough) in Stanbol?
>>>> >>>>>>
>>>> >>>>>> Thanks,
>>>> >>>>>> Cristian
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>>
>>>> >>
>>>>
>>>>
>>>>
>>>> --
>>>> | Rupert Westenthaler             [email protected]
>>>> | Bodenlehenstraße 11                             ++43-699-11108907
>>>> | A-5500 Bischofshofen
>>>>
>>>
>>>
>>



-- 
| Rupert Westenthaler             [email protected]
| Bodenlehenstraße 11                             ++43-699-11108907
| A-5500 Bischofshofen

Re: Named entity coref resolution based on dbpedia categories and rdf:type

Reply via email to