Re: Named entity coref resolution based on dbpedia categories and rdf:type

Cristian Petroaca Thu, 20 Mar 2014 02:01:21 -0700

stanbol.enhancer.chain.weighted.chain=["tika;optional","langdetect","opennlp-sentence","opennlp-token","opennlp-pos","opennlp-ner","dbpediaLinking","entityhubExtraction","dbpedia-dereference","pos-chunker"]
service.ranking=I"-2147483648"
stanbol.enhancer.chain.name="default"




2014-03-20 7:39 GMT+02:00 Rupert Westenthaler <[email protected]
>:

> Hi Cristian,
>
> you can not send attachments to the list. Please copy the contents
> directly to the mail
>
> thx
> Rupert
>
> On Wed, Mar 19, 2014 at 9:20 PM, Cristian Petroaca
> <[email protected]> wrote:
> > The config attached.
> >
> >
> > 2014-03-19 9:09 GMT+02:00 Rupert Westenthaler
> > <[email protected]>:
> >
> >> Hi Cristian,
> >>
> >> can you provide the contents of the chain after your modifications?
> >> Would be interesting to test why the chain is no longer active after
> >> the restart.
> >>
> >> You can find the config file in the 'stanbol/fileinstall' folder.
> >>
> >> best
> >> Rupert
> >>
> >> On Tue, Mar 18, 2014 at 8:24 PM, Cristian Petroaca
> >> <[email protected]> wrote:
> >> > Related to the default chain selection rules : before restart I had a
> >> > chain
> >> > with the name 'default' as in I could access it via
> >> > enhancer/chain/default.
> >> > Then I just added another engine to the 'default' chain. I assumed
> that
> >> > after the restart the chain with the 'default' name would be
> persisted.
> >> > So
> >> > the first rule should have been applied after the restart as well. But
> >> > instead I cannot reach it via enhancer/chain/default anymore so its
> >> > gone.
> >> > Anyway, this is not a big deal, it's not blocking me in any way, I
> just
> >> > wanted to understand where the problem is.
> >> >
> >> >
> >> > 2014-03-18 7:15 GMT+02:00 Rupert Westenthaler
> >> > <[email protected]
> >> >>:
> >> >
> >> >> Hi Cristian
> >> >>
> >> >> On Mon, Mar 17, 2014 at 9:43 PM, Cristian Petroaca
> >> >> <[email protected]> wrote:
> >> >> > 1. Updated to the latest code and it's gone. Cool
> >> >> >
> >> >> > 2. I start the stable launcher -> create a new instance of the
> >> >> > PosChunkerEngine -> add it to the default chain. At this point
> >> >> > everything
> >> >> > looks good and works ok.
> >> >> > After I restart the server the default chain is gone and instead I
> >> >> > see
> >> >> this
> >> >> > in the enhancement chains page : all-active (default, id: 149,
> >> >> > ranking:
> >> >> 0,
> >> >> > impl: AllActiveEnginesChain ). all-active did not contain the
> >> >> > 'default'
> >> >> > word before the restart.
> >> >> >
> >> >>
> >> >> Please note the default chain selection rules as described at [1].
> You
> >> >> can also access chains chains under '/enhancer/chain/{chain-name}'
> >> >>
> >> >> best
> >> >> Rupert
> >> >>
> >> >> [1]
> >> >>
> >> >>
> http://stanbol.staging.apache.org/docs/trunk/components/enhancer/chains/#default-chain
> >> >>
> >> >> > It looks like the config files are exactly what I need. Thanks.
> >> >> >
> >> >> >
> >> >> > 2014-03-17 9:26 GMT+02:00 Rupert Westenthaler <
> >> >> [email protected]
> >> >> >>:
> >> >> >
> >> >> >> On Sat, Mar 15, 2014 at 8:34 PM, Cristian Petroaca
> >> >> >> <[email protected]> wrote:
> >> >> >> > Thanks Rupert.
> >> >> >> >
> >> >> >> > A couple more questions/issues :
> >> >> >> >
> >> >> >> > 1. Whenever I start the stanbol server I'm seeing this in the
> >> >> >> > console
> >> >> >> > output :
> >> >> >> >
> >> >> >>
> >> >> >> This should be fixed with STANBOL-1278 [1] [2]
> >> >> >>
> >> >> >> > 2. Whenever I restart the server the Weighted Chains get messed
> >> >> >> > up. I
> >> >> >> > usually use the 'default' chain and add my engine to it so there
> >> >> >> > are
> >> >> 11
> >> >> >> > engines in it. After the restart this chain now contains around
> 23
> >> >> >> engines
> >> >> >> > in total.
> >> >> >>
> >> >> >> I was not able to replicate this. What I tried was
> >> >> >>
> >> >> >> (1) start up the stable launcher
> >> >> >> (2) add an additional engine to the default chain
> >> >> >> (3) restart the launcher
> >> >> >>
> >> >> >> The default chain was not changed after (2) and (3). So I would
> need
> >> >> >> further information for knowing why this is happening.
> >> >> >>
> >> >> >> Generally it is better to create you own chain instance as
> modifying
> >> >> >> one that is provided by the default configuration. I would also
> >> >> >> recommend that you keep your test configuration in text files and
> to
> >> >> >> copy those to the 'stanbol/fileinstall' folder. Doing so prevent
> you
> >> >> >> from manually entering the configuration after a software update.
> >> >> >> The
> >> >> >> production-mode section [3] provides information on how to do
> that.
> >> >> >>
> >> >> >> best
> >> >> >> Rupert
> >> >> >>
> >> >> >> [1] https://issues.apache.org/jira/browse/STANBOL-1278
> >> >> >> [2] http://svn.apache.org/r1576623
> >> >> >> [3] http://stanbol.apache.org/docs/trunk/production-mode
> >> >> >>
> >> >> >> > ERROR: Bundle org.apache.stanbol.enhancer.engine.topic.web
> [153]:
> >> >> Error
> >> >> >> > starting
> >> >> >> >
> >> >> >>
> >> >>
> >> >>
> slinginstall:c:\Data\Projects\Stanbol\main\launchers\stable\target\stanbol\star
> >> >> >> >
> >> >> >> >
> tup\35\org.apache.stanbol.enhancer.engine.topic.web-1.0.0-SNAPSHOT.jar
> >> >> >> > (org.osgi
> >> >> >> > .framework.BundleException: Unresolved constraint in bundle
> >> >> >> > org.apache.stanbol.e
> >> >> >> > nhancer.engine.topic.web [153]: Unable to resolve 153.0: missing
> >> >> >> > requirement [15
> >> >> >> > 3.0] package; (&(package=javax.ws.rs
> >> >> >> )(version>=0.0.0)(!(version>=2.0.0))))
> >> >> >> > org.osgi.framework.BundleException: Unresolved constraint in
> >> >> >> > bundle
> >> >> >> > org.apache.s
> >> >> >> > tanbol.enhancer.engine.topic.web [153]: Unable to resolve 153.0:
> >> >> missing
> >> >> >> > require
> >> >> >> > ment [153.0] package; (&(package=javax.ws.rs
> >> >> >> > )(version>=0.0.0)(!(version>=2.0.0))
> >> >> >> > )
> >> >> >> >         at
> >> >> >> org.apache.felix.framework.Felix.resolveBundle(Felix.java:3443)
> >> >> >> >         at
> >> >> org.apache.felix.framework.Felix.startBundle(Felix.java:1727)
> >> >> >> >         at
> >> >> >> >
> >> >> >> >
> org.apache.felix.framework.Felix.setActiveStartLevel(Felix.java:1156)
> >> >> >> >
> >> >> >> >         at
> >> >> >> >
> >> >> >> >
> org.apache.felix.framework.StartLevelImpl.run(StartLevelImpl.java:264
> >> >> >> > )
> >> >> >> >         at java.lang.Thread.run(Unknown Source)
> >> >> >> >
> >> >> >> > Despite of this the server starts fine and I can use the
> enhancer
> >> >> fine.
> >> >> >> Do
> >> >> >> > you guys see this as well?
> >> >> >> >
> >> >> >> >
> >> >> >> > 2. Whenever I restart the server the Weighted Chains get messed
> >> >> >> > up. I
> >> >> >> > usually use the 'default' chain and add my engine to it so there
> >> >> >> > are
> >> >> 11
> >> >> >> > engines in it. After the restart this chain now contains around
> 23
> >> >> >> engines
> >> >> >> > in total.
> >> >> >> >
> >> >> >> >
> >> >> >> >
> >> >> >> >
> >> >> >> > 2014-03-11 9:47 GMT+02:00 Rupert Westenthaler <
> >> >> >> [email protected]
> >> >> >> >>:
> >> >> >> >
> >> >> >> >> Hi Cristian,
> >> >> >> >>
> >> >> >> >> NER Annotations are typically available as both
> >> >> >> >> NlpAnnotations.NER_ANNOTATION and  fise:TextAnnotation [1] in
> the
> >> >> >> >> enhancement metadata. As you are already accessing the
> >> >> >> >> AnayzedText I
> >> >> >> >> would prefer using the  NlpAnnotations.NER_ANNOTATION.
> >> >> >> >>
> >> >> >> >> best
> >> >> >> >> Rupert
> >> >> >> >>
> >> >> >> >> [1]
> >> >> >> >>
> >> >> >>
> >> >>
> >> >>
> http://stanbol.apache.org/docs/trunk/components/enhancer/enhancementstructure.html#fisetextannotation
> >> >> >> >>
> >> >> >> >> On Mon, Mar 10, 2014 at 10:07 PM, Cristian Petroaca
> >> >> >> >> <[email protected]> wrote:
> >> >> >> >> > Thanks.
> >> >> >> >> > I assume I should get the Named entities using the same but
> >> >> >> >> > with
> >> >> >> >> > NlpAnnotations.NER_ANNOTATION?
> >> >> >> >> >
> >> >> >> >> >
> >> >> >> >> >
> >> >> >> >> > 2014-03-10 13:29 GMT+02:00 Rupert Westenthaler <
> >> >> >> >> > [email protected]>:
> >> >> >> >> >
> >> >> >> >> >> Hallo Cristian,
> >> >> >> >> >>
> >> >> >> >> >> NounPhrases are not added to the RDF enhancement results.
> You
> >> >> need to
> >> >> >> >> >> use the AnalyzedText ContentPart [1]
> >> >> >> >> >>
> >> >> >> >> >> here is some demo code you can use in the computeEnhancement
> >> >> method
> >> >> >> >> >>
> >> >> >> >> >>         AnalysedText at =
> >> >> >> >> >> NlpEngineHelper.getAnalysedText(this,
> >> >> ci,
> >> >> >> >> true);
> >> >> >> >> >>         Iterator<? extends Section> sections =
> >> >> >> >> >> at.getSentences();
> >> >> >> >> >>         if(!sections.hasNext()){ //process as single
> sentence
> >> >> >> >> >>             sections = Collections.singleton(at).iterator();
> >> >> >> >> >>         }
> >> >> >> >> >>
> >> >> >> >> >>         while(sections.hasNext()){
> >> >> >> >> >>             Section section = sections.next();
> >> >> >> >> >>             Iterator<Span> chunks =
> >> >> >> >> >> section.getEnclosed(EnumSet.of(SpanTypeEnum.Chunk));
> >> >> >> >> >>             while(chunks.hasNext()){
> >> >> >> >> >>                 Span chunk = chunks.next();
> >> >> >> >> >>                 Value<PhraseTag> phrase =
> >> >> >> >> >> chunk.getAnnotation(NlpAnnotations.PHRASE_ANNOTATION);
> >> >> >> >> >>                 if(phrase.value().getCategory() ==
> >> >> >> >> LexicalCategory.Noun){
> >> >> >> >> >>                     log.info(" - NounPhrase [{},{}] {}",
> new
> >> >> >> Object[]{
> >> >> >> >> >>
> >> >> >> >> >> chunk.getStart(),chunk.getEnd(),chunk.getSpan()});
> >> >> >> >> >>                 }
> >> >> >> >> >>             }
> >> >> >> >> >>         }
> >> >> >> >> >>
> >> >> >> >> >> hope this helps
> >> >> >> >> >>
> >> >> >> >> >> best
> >> >> >> >> >> Rupert
> >> >> >> >> >>
> >> >> >> >> >> [1]
> >> >> >> >> >>
> >> >> >> >>
> >> >> >>
> >> >>
> >> >>
> http://stanbol.apache.org/docs/trunk/components/enhancer/nlp/analyzedtext
> >> >> >> >> >>
> >> >> >> >> >> On Sun, Mar 9, 2014 at 6:07 PM, Cristian Petroaca
> >> >> >> >> >> <[email protected]> wrote:
> >> >> >> >> >> > I started to implement the engine and I'm having problems
> >> >> >> >> >> > with
> >> >> >> getting
> >> >> >> >> >> > results for noun phrases. I modified the "default"
> weighted
> >> >> chain
> >> >> >> to
> >> >> >> >> also
> >> >> >> >> >> > include the PosChunkerEngine and ran a sample text :
> "Angela
> >> >> Merkel
> >> >> >> >> >> visted
> >> >> >> >> >> > China. The german chancellor met with various people". I
> >> >> expected
> >> >> >> that
> >> >> >> >> >> the
> >> >> >> >> >> > RDF XML output would contain some info about the noun
> >> >> >> >> >> > phrases
> >> >> but I
> >> >> >> >> >> cannot
> >> >> >> >> >> > see any.
> >> >> >> >> >> > Could you point me to the correct way to generate the noun
> >> >> phrases?
> >> >> >> >> >> >
> >> >> >> >> >> > Thanks,
> >> >> >> >> >> > Cristian
> >> >> >> >> >> >
> >> >> >> >> >> >
> >> >> >> >> >> > 2014-02-09 14:15 GMT+02:00 Cristian Petroaca <
> >> >> >> >> >> [email protected]>:
> >> >> >> >> >> >
> >> >> >> >> >> >> Opened
> https://issues.apache.org/jira/browse/STANBOL-1279
> >> >> >> >> >> >>
> >> >> >> >> >> >>
> >> >> >> >> >> >> 2014-02-07 10:53 GMT+02:00 Cristian Petroaca <
> >> >> >> >> >> [email protected]>
> >> >> >> >> >> >> :
> >> >> >> >> >> >>
> >> >> >> >> >> >> Hi Rupert,
> >> >> >> >> >> >>>
> >> >> >> >> >> >>> The "spatial" dimension is a good idea. I'll also take a
> >> >> >> >> >> >>> look
> >> >> at
> >> >> >> >> Yago.
> >> >> >> >> >> >>>
> >> >> >> >> >> >>> I will create a Jira with what we talked about here. It
> >> >> >> >> >> >>> will
> >> >> >> >> probably
> >> >> >> >> >> >>> have just a draft-like description for now and will be
> >> >> >> >> >> >>> updated
> >> >> >> as I
> >> >> >> >> go
> >> >> >> >> >> >>> along.
> >> >> >> >> >> >>>
> >> >> >> >> >> >>> Thanks,
> >> >> >> >> >> >>> Cristian
> >> >> >> >> >> >>>
> >> >> >> >> >> >>>
> >> >> >> >> >> >>> 2014-02-06 15:39 GMT+02:00 Rupert Westenthaler <
> >> >> >> >> >> >>> [email protected]>:
> >> >> >> >> >> >>>
> >> >> >> >> >> >>> Hi Cristian,
> >> >> >> >> >> >>>>
> >> >> >> >> >> >>>> definitely an interesting approach. You should have a
> >> >> >> >> >> >>>> look at
> >> >> >> Yago2
> >> >> >> >> >> >>>> [1]. As far as I can remember the Yago taxonomy is much
> >> >> better
> >> >> >> >> >> >>>> structured as the one used by dbpedia. Mapping
> >> >> >> >> >> >>>> suggestions of
> >> >> >> >> dbpedia
> >> >> >> >> >> >>>> to concepts in Yago2 is easy as both dbpedia and yago2
> do
> >> >> >> provide
> >> >> >> >> >> >>>> mappings [2] and [3]
> >> >> >> >> >> >>>>
> >> >> >> >> >> >>>> > 2014-02-05 15:39 GMT+02:00 Rafa Haro
> >> >> >> >> >> >>>> > <[email protected]>:
> >> >> >> >> >> >>>> >>
> >> >> >> >> >> >>>> >> "Microsoft posted its 2013 earnings. The Redmond's
> >> >> >> >> >> >>>> >> company
> >> >> >> made
> >> >> >> >> a
> >> >> >> >> >> >>>> >> huge profit".
> >> >> >> >> >> >>>>
> >> >> >> >> >> >>>> Thats actually a very good example. Spatial contexts
> are
> >> >> >> >> >> >>>> very
> >> >> >> >> >> >>>> important as they tend to be often used for
> referencing.
> >> >> >> >> >> >>>> So I
> >> >> >> would
> >> >> >> >> >> >>>> suggest to specially treat the spatial context. For
> >> >> >> >> >> >>>> spatial
> >> >> >> >> Entities
> >> >> >> >> >> >>>> (like a City) this is easy, but even for other (like a
> >> >> Person,
> >> >> >> >> >> >>>> Company) you could use relations to spatial entities
> >> >> >> >> >> >>>> define
> >> >> >> their
> >> >> >> >> >> >>>> spatial context. This context could than be used to
> >> >> >> >> >> >>>> correctly
> >> >> >> link
> >> >> >> >> >> >>>> "The Redmond's company" to "Microsoft".
> >> >> >> >> >> >>>>
> >> >> >> >> >> >>>> In addition I would suggest to use the "spatial"
> context
> >> >> >> >> >> >>>> of
> >> >> each
> >> >> >> >> >> >>>> entity (basically relation to entities that are cities,
> >> >> regions,
> >> >> >> >> >> >>>> countries) as a separate dimension, because those are
> >> >> >> >> >> >>>> very
> >> >> often
> >> >> >> >> used
> >> >> >> >> >> >>>> for coreferences.
> >> >> >> >> >> >>>>
> >> >> >> >> >> >>>> [1] http://www.mpi-inf.mpg.de/yago-naga/yago/
> >> >> >> >> >> >>>> [2]
> >> >> >> >> >> >>>>
> http://downloads.dbpedia.org/3.9/links/yago_links.nt.bz2
> >> >> >> >> >> >>>> [3]
> >> >> >> >> >> >>>>
> >> >> >> >> >>
> >> >> >> >>
> >> >> >>
> >> >>
> >> >>
> http://www.mpi-inf.mpg.de/yago-naga/yago/download/yago/yagoDBpediaInstances.ttl.7z
> >> >> >> >> >> >>>>
> >> >> >> >> >> >>>>
> >> >> >> >> >> >>>> On Thu, Feb 6, 2014 at 10:33 AM, Cristian Petroaca
> >> >> >> >> >> >>>> <[email protected]> wrote:
> >> >> >> >> >> >>>> > There are several dbpedia categories for each entity,
> >> >> >> >> >> >>>> > in
> >> >> this
> >> >> >> >> case
> >> >> >> >> >> for
> >> >> >> >> >> >>>> > Microsoft we have :
> >> >> >> >> >> >>>> >
> >> >> >> >> >> >>>> > category:Companies_in_the_NASDAQ-100_Index
> >> >> >> >> >> >>>> > category:Microsoft
> >> >> >> >> >> >>>> > category:Software_companies_of_the_United_States
> >> >> >> >> >> >>>> >
> category:Software_companies_based_in_Washington_(state)
> >> >> >> >> >> >>>> > category:Companies_established_in_1975
> >> >> >> >> >> >>>> > category:1975_establishments_in_the_United_States
> >> >> >> >> >> >>>> > category:Companies_based_in_Redmond,_Washington
> >> >> >> >> >> >>>> >
> >> >> >> >>
> >> >> >> >>
> category:Multinational_companies_headquartered_in_the_United_States
> >> >> >> >> >> >>>> > category:Cloud_computing_providers
> >> >> >> >> >> >>>> >
> category:Companies_in_the_Dow_Jones_Industrial_Average
> >> >> >> >> >> >>>> >
> >> >> >> >> >> >>>> > So we also have "Companies based in
> Redmont,Washington"
> >> >> which
> >> >> >> >> could
> >> >> >> >> >> be
> >> >> >> >> >> >>>> > matched.
> >> >> >> >> >> >>>> >
> >> >> >> >> >> >>>> >
> >> >> >> >> >> >>>> > There is still other contextual information from
> >> >> >> >> >> >>>> > dbpedia
> >> >> which
> >> >> >> >> can
> >> >> >> >> >> be
> >> >> >> >> >> >>>> used.
> >> >> >> >> >> >>>> > For example for an Organization we could also
> include :
> >> >> >> >> >> >>>> > dbpprop:industry = Software
> >> >> >> >> >> >>>> > dbpprop:service = Online Service Providers
> >> >> >> >> >> >>>> >
> >> >> >> >> >> >>>> > and for a Person (that's for Barack Obama) :
> >> >> >> >> >> >>>> >
> >> >> >> >> >> >>>> > dbpedia-owl:profession:
> >> >> >> >> >> >>>> >                                dbpedia:Author
> >> >> >> >> >> >>>> >
> >> >> >> >> >> >>>> > dbpedia:Constitutional_law
> >> >> >> >> >> >>>> >                                dbpedia:Lawyer
> >> >> >> >> >> >>>> >
> >> >> >> >> >> >>>> > dbpedia:Community_organizing
> >> >> >> >> >> >>>> >
> >> >> >> >> >> >>>> > I'd like to continue investigating this as I think
> that
> >> >> >> >> >> >>>> > it
> >> >> may
> >> >> >> >> have
> >> >> >> >> >> >>>> some
> >> >> >> >> >> >>>> > value in increasing the number of coreference
> >> >> >> >> >> >>>> > resolutions
> >> >> and
> >> >> >> I'd
> >> >> >> >> >> like
> >> >> >> >> >> >>>> to
> >> >> >> >> >> >>>> > concentrate more on precision rather than recall
> since
> >> >> >> >> >> >>>> > we
> >> >> >> already
> >> >> >> >> >> have
> >> >> >> >> >> >>>> a
> >> >> >> >> >> >>>> > set of coreferences detected by the stanford nlp tool
> >> >> >> >> >> >>>> > and
> >> >> this
> >> >> >> >> would
> >> >> >> >> >> >>>> be as
> >> >> >> >> >> >>>> > an addition to that (at least this is how I would
> like
> >> >> >> >> >> >>>> > to
> >> >> use
> >> >> >> >> it).
> >> >> >> >> >> >>>> >
> >> >> >> >> >> >>>> > Is it ok if I track this by opening a jira? I could
> >> >> >> >> >> >>>> > update
> >> >> it
> >> >> >> to
> >> >> >> >> >> show
> >> >> >> >> >> >>>> my
> >> >> >> >> >> >>>> > progress and also my conclusions and if it turns out
> >> >> >> >> >> >>>> > that
> >> >> it
> >> >> >> was
> >> >> >> >> a
> >> >> >> >> >> bad
> >> >> >> >> >> >>>> idea
> >> >> >> >> >> >>>> > then that's the situation at least I'll end up with
> >> >> >> >> >> >>>> > more
> >> >> >> >> knowledge
> >> >> >> >> >> >>>> about
> >> >> >> >> >> >>>> > Stanbol in the end :).
> >> >> >> >> >> >>>> >
> >> >> >> >> >> >>>> >
> >> >> >> >> >> >>>> > 2014-02-05 15:39 GMT+02:00 Rafa Haro
> >> >> >> >> >> >>>> > <[email protected]>:
> >> >> >> >> >> >>>> >
> >> >> >> >> >> >>>> >> Hi Cristian,
> >> >> >> >> >> >>>> >>
> >> >> >> >> >> >>>> >> The approach sounds nice. I don't want to be the
> >> >> >> >> >> >>>> >> devil's
> >> >> >> >> advocate
> >> >> >> >> >> but
> >> >> >> >> >> >>>> I'm
> >> >> >> >> >> >>>> >> just not sure about the recall using the dbpedia
> >> >> categories
> >> >> >> >> >> feature.
> >> >> >> >> >> >>>> For
> >> >> >> >> >> >>>> >> example, your sentence could be also "Microsoft
> posted
> >> >> >> >> >> >>>> >> its
> >> >> >> 2013
> >> >> >> >> >> >>>> earnings.
> >> >> >> >> >> >>>> >> The Redmond's company made a huge profit". So, maybe
> >> >> >> including
> >> >> >> >> more
> >> >> >> >> >> >>>> >> contextual information from dbpedia could increase
> the
> >> >> recall
> >> >> >> >> but
> >> >> >> >> >> of
> >> >> >> >> >> >>>> course
> >> >> >> >> >> >>>> >> will reduce the precision.
> >> >> >> >> >> >>>> >>
> >> >> >> >> >> >>>> >> Cheers,
> >> >> >> >> >> >>>> >> Rafa
> >> >> >> >> >> >>>> >>
> >> >> >> >> >> >>>> >> El 04/02/14 09:50, Cristian Petroaca escribió:
> >> >> >> >> >> >>>> >>
> >> >> >> >> >> >>>> >>  Back with a more detailed description of the steps
> >> >> >> >> >> >>>> >> for
> >> >> >> making
> >> >> >> >> this
> >> >> >> >> >> >>>> kind of
> >> >> >> >> >> >>>> >>> coreference work.
> >> >> >> >> >> >>>> >>>
> >> >> >> >> >> >>>> >>> I will be using references to the following text in
> >> >> >> >> >> >>>> >>> the
> >> >> >> steps
> >> >> >> >> >> below
> >> >> >> >> >> >>>> in
> >> >> >> >> >> >>>> >>> order to make things clearer : "Microsoft posted
> its
> >> >> >> >> >> >>>> >>> 2013
> >> >> >> >> >> earnings.
> >> >> >> >> >> >>>> The
> >> >> >> >> >> >>>> >>> software company made a huge profit."
> >> >> >> >> >> >>>> >>>
> >> >> >> >> >> >>>> >>> 1. For every noun phrase in the text which has :
> >> >> >> >> >> >>>> >>>      a. a determinate pos which implies reference
> to
> >> >> >> >> >> >>>> >>> an
> >> >> >> entity
> >> >> >> >> >> local
> >> >> >> >> >> >>>> to
> >> >> >> >> >> >>>> >>> the
> >> >> >> >> >> >>>> >>> text, such as "the, this, these") but not "another,
> >> >> every",
> >> >> >> etc
> >> >> >> >> >> which
> >> >> >> >> >> >>>> >>> implies a reference to an entity outside of the
> text.
> >> >> >> >> >> >>>> >>>      b. having at least another noun aside from the
> >> >> >> >> >> >>>> >>> main
> >> >> >> >> required
> >> >> >> >> >> >>>> noun
> >> >> >> >> >> >>>> >>> which
> >> >> >> >> >> >>>> >>> further describes it. For example I will not count
> >> >> >> >> >> >>>> >>> "The
> >> >> >> >> company"
> >> >> >> >> >> as
> >> >> >> >> >> >>>> being
> >> >> >> >> >> >>>> >>> a
> >> >> >> >> >> >>>> >>> legitimate candidate since this could create a lot
> of
> >> >> false
> >> >> >> >> >> >>>> positives by
> >> >> >> >> >> >>>> >>> considering the double meaning of some words such
> as
> >> >> >> >> >> >>>> >>> "in
> >> >> the
> >> >> >> >> >> company
> >> >> >> >> >> >>>> of
> >> >> >> >> >> >>>> >>> good people".
> >> >> >> >> >> >>>> >>> "The software company" is a good candidate since we
> >> >> >> >> >> >>>> >>> also
> >> >> >> have
> >> >> >> >> >> >>>> "software".
> >> >> >> >> >> >>>> >>>
> >> >> >> >> >> >>>> >>> 2. match the nouns in the noun phrase to the
> contents
> >> >> >> >> >> >>>> >>> of
> >> >> the
> >> >> >> >> >> dbpedia
> >> >> >> >> >> >>>> >>> categories of each named entity found prior to the
> >> >> location
> >> >> >> of
> >> >> >> >> the
> >> >> >> >> >> >>>> noun
> >> >> >> >> >> >>>> >>> phrase in the text.
> >> >> >> >> >> >>>> >>> The dbpedia categories are in the following format
> >> >> >> >> >> >>>> >>> (for
> >> >> >> >> Microsoft
> >> >> >> >> >> for
> >> >> >> >> >> >>>> >>> example) : "Software companies of the United
> States".
> >> >> >> >> >> >>>> >>>   So we try to match "software company" with that.
> >> >> >> >> >> >>>> >>> First, as you can see, the main noun in the dbpedia
> >> >> category
> >> >> >> >> has a
> >> >> >> >> >> >>>> plural
> >> >> >> >> >> >>>> >>> form and it's the same for all categories which I
> >> >> >> >> >> >>>> >>> saw. I
> >> >> >> don't
> >> >> >> >> >> know
> >> >> >> >> >> >>>> if
> >> >> >> >> >> >>>> >>> there's an easier way to do this but I thought of
> >> >> applying a
> >> >> >> >> >> >>>> lemmatizer on
> >> >> >> >> >> >>>> >>> the category and the noun phrase in order for them
> to
> >> >> have a
> >> >> >> >> >> common
> >> >> >> >> >> >>>> >>> denominator.This also works if the noun phrase
> itself
> >> >> has a
> >> >> >> >> plural
> >> >> >> >> >> >>>> form.
> >> >> >> >> >> >>>> >>>
> >> >> >> >> >> >>>> >>> Second, I'll need to use for comparison only the
> >> >> >> >> >> >>>> >>> words in
> >> >> >> the
> >> >> >> >> >> >>>> category
> >> >> >> >> >> >>>> >>> which are themselves nouns and not prepositions or
> >> >> >> determiners
> >> >> >> >> >> such
> >> >> >> >> >> >>>> as "of
> >> >> >> >> >> >>>> >>> the".This means that I need to pos tag the
> categories
> >> >> >> contents
> >> >> >> >> as
> >> >> >> >> >> >>>> well.
> >> >> >> >> >> >>>> >>> I was thinking of running the pos and lemma on the
> >> >> dbpedia
> >> >> >> >> >> >>>> categories when
> >> >> >> >> >> >>>> >>> building the dbpedia backed entity hub and storing
> >> >> >> >> >> >>>> >>> them
> >> >> for
> >> >> >> >> later
> >> >> >> >> >> >>>> use - I
> >> >> >> >> >> >>>> >>> don't know how feasible this is at the moment.
> >> >> >> >> >> >>>> >>>
> >> >> >> >> >> >>>> >>> After this I can compare each noun in the noun
> phrase
> >> >> with
> >> >> >> the
> >> >> >> >> >> >>>> equivalent
> >> >> >> >> >> >>>> >>> nouns in the categories and based on the number of
> >> >> matches I
> >> >> >> >> can
> >> >> >> >> >> >>>> create a
> >> >> >> >> >> >>>> >>> confidence level.
> >> >> >> >> >> >>>> >>>
> >> >> >> >> >> >>>> >>> 3. match the noun of the noun phrase with the
> >> >> >> >> >> >>>> >>> rdf:type
> >> >> from
> >> >> >> >> >> dbpedia
> >> >> >> >> >> >>>> of the
> >> >> >> >> >> >>>> >>> named entity. If this matches increase the
> confidence
> >> >> level.
> >> >> >> >> >> >>>> >>>
> >> >> >> >> >> >>>> >>> 4. If there are multiple named entities which can
> >> >> >> >> >> >>>> >>> match a
> >> >> >> >> certain
> >> >> >> >> >> >>>> noun
> >> >> >> >> >> >>>> >>> phrase then link the noun phrase with the closest
> >> >> >> >> >> >>>> >>> named
> >> >> >> entity
> >> >> >> >> >> prior
> >> >> >> >> >> >>>> to it
> >> >> >> >> >> >>>> >>> in the text.
> >> >> >> >> >> >>>> >>>
> >> >> >> >> >> >>>> >>> What do you think?
> >> >> >> >> >> >>>> >>>
> >> >> >> >> >> >>>> >>> Cristian
> >> >> >> >> >> >>>> >>>
> >> >> >> >> >> >>>> >>> 2014-01-31 Cristian Petroaca <
> >> >> [email protected]>:
> >> >> >> >> >> >>>> >>>
> >> >> >> >> >> >>>> >>>  Hi Rafa,
> >> >> >> >> >> >>>> >>>>
> >> >> >> >> >> >>>> >>>> I don't yet have a concrete heursitic but I'm
> >> >> >> >> >> >>>> >>>> working on
> >> >> >> it.
> >> >> >> >> I'll
> >> >> >> >> >> >>>> provide
> >> >> >> >> >> >>>> >>>> it here so that you guys can give me a feedback on
> >> >> >> >> >> >>>> >>>> it.
> >> >> >> >> >> >>>> >>>>
> >> >> >> >> >> >>>> >>>> What are "locality" features?
> >> >> >> >> >> >>>> >>>>
> >> >> >> >> >> >>>> >>>> I looked at Bart and other coref tools such as
> >> >> >> >> >> >>>> >>>> ArkRef
> >> >> and
> >> >> >> >> >> >>>> CherryPicker
> >> >> >> >> >> >>>> >>>> and
> >> >> >> >> >> >>>> >>>> they don't provide such a coreference.
> >> >> >> >> >> >>>> >>>>
> >> >> >> >> >> >>>> >>>> Cristian
> >> >> >> >> >> >>>> >>>>
> >> >> >> >> >> >>>> >>>>
> >> >> >> >> >> >>>> >>>> 2014-01-30 Rafa Haro <[email protected]>:
> >> >> >> >> >> >>>> >>>>
> >> >> >> >> >> >>>> >>>> Hi Cristian,
> >> >> >> >> >> >>>> >>>>
> >> >> >> >> >> >>>> >>>>> Without having more details about your concrete
> >> >> heuristic,
> >> >> >> >> in my
> >> >> >> >> >> >>>> honest
> >> >> >> >> >> >>>> >>>>> opinion, such approach could produce a lot of
> false
> >> >> >> >> positives. I
> >> >> >> >> >> >>>> don't
> >> >> >> >> >> >>>> >>>>> know
> >> >> >> >> >> >>>> >>>>> if you are planning to use some "locality"
> features
> >> >> >> >> >> >>>> >>>>> to
> >> >> >> detect
> >> >> >> >> >> such
> >> >> >> >> >> >>>> >>>>> coreferences but you need to take into account
> that
> >> >> >> >> >> >>>> >>>>> it
> >> >> is
> >> >> >> >> quite
> >> >> >> >> >> >>>> usual
> >> >> >> >> >> >>>> >>>>> that
> >> >> >> >> >> >>>> >>>>> coreferenced mentions can occurs even in
> different
> >> >> >> >> paragraphs.
> >> >> >> >> >> >>>> Although
> >> >> >> >> >> >>>> >>>>> I'm
> >> >> >> >> >> >>>> >>>>> not an expert in Natural Language Understanding,
> I
> >> >> would
> >> >> >> say
> >> >> >> >> it
> >> >> >> >> >> is
> >> >> >> >> >> >>>> quite
> >> >> >> >> >> >>>> >>>>> difficult to get decent precision/recall rates
> for
> >> >> >> >> coreferencing
> >> >> >> >> >> >>>> using
> >> >> >> >> >> >>>> >>>>> fixed rules. Maybe you can give a try to others
> >> >> >> >> >> >>>> >>>>> tools
> >> >> like
> >> >> >> >> BART
> >> >> >> >> >> (
> >> >> >> >> >> >>>> >>>>> http://www.bart-coref.org/).
> >> >> >> >> >> >>>> >>>>>
> >> >> >> >> >> >>>> >>>>> Cheers,
> >> >> >> >> >> >>>> >>>>> Rafa Haro
> >> >> >> >> >> >>>> >>>>>
> >> >> >> >> >> >>>> >>>>> El 30/01/14 10:33, Cristian Petroaca escribió:
> >> >> >> >> >> >>>> >>>>>
> >> >> >> >> >> >>>> >>>>>   Hi,
> >> >> >> >> >> >>>> >>>>>
> >> >> >> >> >> >>>> >>>>>> One of the necessary steps for implementing the
> >> >> >> >> >> >>>> >>>>>> Event
> >> >> >> >> >> extraction
> >> >> >> >> >> >>>> Engine
> >> >> >> >> >> >>>> >>>>>> feature :
> >> >> >> >> https://issues.apache.org/jira/browse/STANBOL-1121is
> >> >> >> >> >> >>>> to
> >> >> >> >> >> >>>> >>>>>> have
> >> >> >> >> >> >>>> >>>>>> coreference resolution in the given text. This
> is
> >> >> >> provided
> >> >> >> >> now
> >> >> >> >> >> >>>> via the
> >> >> >> >> >> >>>> >>>>>> stanford-nlp project but as far as I saw this
> >> >> >> >> >> >>>> >>>>>> module
> >> >> is
> >> >> >> >> >> performing
> >> >> >> >> >> >>>> >>>>>> mostly
> >> >> >> >> >> >>>> >>>>>> pronomial (He, She) or nominal (Barack Obama and
> >> >> >> >> >> >>>> >>>>>> Mr.
> >> >> >> Obama)
> >> >> >> >> >> >>>> coreference
> >> >> >> >> >> >>>> >>>>>> resolution.
> >> >> >> >> >> >>>> >>>>>>
> >> >> >> >> >> >>>> >>>>>> In order to get more coreferences from the text
> I
> >> >> though
> >> >> >> of
> >> >> >> >> >> >>>> creating
> >> >> >> >> >> >>>> >>>>>> some
> >> >> >> >> >> >>>> >>>>>> logic that would detect this kind of
> coreference :
> >> >> >> >> >> >>>> >>>>>> "Apple reaches new profit heights. The software
> >> >> company
> >> >> >> just
> >> >> >> >> >> >>>> announced
> >> >> >> >> >> >>>> >>>>>> its
> >> >> >> >> >> >>>> >>>>>> 2013 earnings."
> >> >> >> >> >> >>>> >>>>>> Here "The software company" obviously refers to
> >> >> "Apple".
> >> >> >> >> >> >>>> >>>>>> So I'd like to detect coreferences of Named
> >> >> >> >> >> >>>> >>>>>> Entities
> >> >> >> which
> >> >> >> >> are
> >> >> >> >> >> of
> >> >> >> >> >> >>>> the
> >> >> >> >> >> >>>> >>>>>> rdf:type of the Named Entity , in this case
> >> >> >> >> >> >>>> >>>>>> "company"
> >> >> and
> >> >> >> >> also
> >> >> >> >> >> >>>> have
> >> >> >> >> >> >>>> >>>>>> attributes which can be found in the dbpedia
> >> >> categories
> >> >> >> of
> >> >> >> >> the
> >> >> >> >> >> >>>> named
> >> >> >> >> >> >>>> >>>>>> entity, in this case "software".
> >> >> >> >> >> >>>> >>>>>>
> >> >> >> >> >> >>>> >>>>>> The detection of coreferences such as "The
> >> >> >> >> >> >>>> >>>>>> software
> >> >> >> >> company" in
> >> >> >> >> >> >>>> the
> >> >> >> >> >> >>>> >>>>>> text
> >> >> >> >> >> >>>> >>>>>> would also be done by either using the new Pos
> Tag
> >> >> Based
> >> >> >> >> Phrase
> >> >> >> >> >> >>>> >>>>>> extraction
> >> >> >> >> >> >>>> >>>>>> Engine (noun phrases) or by using a dependency
> >> >> >> >> >> >>>> >>>>>> tree of
> >> >> >> the
> >> >> >> >> >> >>>> sentence and
> >> >> >> >> >> >>>> >>>>>> picking up only subjects or objects.
> >> >> >> >> >> >>>> >>>>>>
> >> >> >> >> >> >>>> >>>>>> At this point I'd like to know if this kind of
> >> >> >> >> >> >>>> >>>>>> logic
> >> >> >> would
> >> >> >> >> be
> >> >> >> >> >> >>>> useful
> >> >> >> >> >> >>>> >>>>>> as a
> >> >> >> >> >> >>>> >>>>>> separate Enhancement Engine (in case the
> precision
> >> >> >> >> >> >>>> >>>>>> and
> >> >> >> >> recall
> >> >> >> >> >> are
> >> >> >> >> >> >>>> good
> >> >> >> >> >> >>>> >>>>>> enough) in Stanbol?
> >> >> >> >> >> >>>> >>>>>>
> >> >> >> >> >> >>>> >>>>>> Thanks,
> >> >> >> >> >> >>>> >>>>>> Cristian
> >> >> >> >> >> >>>> >>>>>>
> >> >> >> >> >> >>>> >>>>>>
> >> >> >> >> >> >>>> >>>>>>
> >> >> >> >> >> >>>> >>
> >> >> >> >> >> >>>>
> >> >> >> >> >> >>>>
> >> >> >> >> >> >>>>
> >> >> >> >> >> >>>> --
> >> >> >> >> >> >>>> | Rupert Westenthaler
> >> >> [email protected]
> >> >> >> >> >> >>>> | Bodenlehenstraße 11
> >> >> >> >> ++43-699-11108907
> >> >> >> >> >> >>>> | A-5500 Bischofshofen
> >> >> >> >> >> >>>>
> >> >> >> >> >> >>>
> >> >> >> >> >> >>>
> >> >> >> >> >> >>
> >> >> >> >> >>
> >> >> >> >> >>
> >> >> >> >> >>
> >> >> >> >> >> --
> >> >> >> >> >> | Rupert Westenthaler
> >> >> >> >> >> [email protected]
> >> >> >> >> >> | Bodenlehenstraße 11
> >> >> ++43-699-11108907
> >> >> >> >> >> | A-5500 Bischofshofen
> >> >> >> >> >>
> >> >> >> >>
> >> >> >> >>
> >> >> >> >>
> >> >> >> >> --
> >> >> >> >> | Rupert Westenthaler
> [email protected]
> >> >> >> >> | Bodenlehenstraße 11
> >> >> >> >> ++43-699-11108907
> >> >> >> >> | A-5500 Bischofshofen
> >> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >> --
> >> >> >> | Rupert Westenthaler             [email protected]
> >> >> >> | Bodenlehenstraße 11
> ++43-699-11108907
> >> >> >> | A-5500 Bischofshofen
> >> >> >>
> >> >>
> >> >>
> >> >>
> >> >> --
> >> >> | Rupert Westenthaler             [email protected]
> >> >> | Bodenlehenstraße 11                             ++43-699-11108907
> >> >> | A-5500 Bischofshofen
> >> >>
> >>
> >>
> >>
> >> --
> >> | Rupert Westenthaler             [email protected]
> >> | Bodenlehenstraße 11                             ++43-699-11108907
> >> | A-5500 Bischofshofen
> >
> >
>
>
>
> --
> | Rupert Westenthaler             [email protected]
> | Bodenlehenstraße 11                             ++43-699-11108907
> | A-5500 Bischofshofen
>

Re: Named entity coref resolution based on dbpedia categories and rdf:type

Reply via email to