Re: Named entity coref resolution based on dbpedia categories and rdf:type

Rupert Westenthaler Wed, 19 Mar 2014 22:41:19 -0700

Hi Cristian,

you can not send attachments to the list. Please copy the contents
directly to the mail


thx
Rupert

On Wed, Mar 19, 2014 at 9:20 PM, Cristian Petroaca
<cristian.petro...@gmail.com> wrote:
> The config attached.
>
>
> 2014-03-19 9:09 GMT+02:00 Rupert Westenthaler
> <rupert.westentha...@gmail.com>:
>
>> Hi Cristian,
>>
>> can you provide the contents of the chain after your modifications?
>> Would be interesting to test why the chain is no longer active after
>> the restart.
>>
>> You can find the config file in the 'stanbol/fileinstall' folder.
>>
>> best
>> Rupert
>>
>> On Tue, Mar 18, 2014 at 8:24 PM, Cristian Petroaca
>> <cristian.petro...@gmail.com> wrote:
>> > Related to the default chain selection rules : before restart I had a
>> > chain
>> > with the name 'default' as in I could access it via
>> > enhancer/chain/default.
>> > Then I just added another engine to the 'default' chain. I assumed that
>> > after the restart the chain with the 'default' name would be persisted.
>> > So
>> > the first rule should have been applied after the restart as well. But
>> > instead I cannot reach it via enhancer/chain/default anymore so its
>> > gone.
>> > Anyway, this is not a big deal, it's not blocking me in any way, I just
>> > wanted to understand where the problem is.
>> >
>> >
>> > 2014-03-18 7:15 GMT+02:00 Rupert Westenthaler
>> > <rupert.westentha...@gmail.com
>> >>:
>> >
>> >> Hi Cristian
>> >>
>> >> On Mon, Mar 17, 2014 at 9:43 PM, Cristian Petroaca
>> >> <cristian.petro...@gmail.com> wrote:
>> >> > 1. Updated to the latest code and it's gone. Cool
>> >> >
>> >> > 2. I start the stable launcher -> create a new instance of the
>> >> > PosChunkerEngine -> add it to the default chain. At this point
>> >> > everything
>> >> > looks good and works ok.
>> >> > After I restart the server the default chain is gone and instead I
>> >> > see
>> >> this
>> >> > in the enhancement chains page : all-active (default, id: 149,
>> >> > ranking:
>> >> 0,
>> >> > impl: AllActiveEnginesChain ). all-active did not contain the
>> >> > 'default'
>> >> > word before the restart.
>> >> >
>> >>
>> >> Please note the default chain selection rules as described at [1]. You
>> >> can also access chains chains under '/enhancer/chain/{chain-name}'
>> >>
>> >> best
>> >> Rupert
>> >>
>> >> [1]
>> >>
>> >> http://stanbol.staging.apache.org/docs/trunk/components/enhancer/chains/#default-chain
>> >>
>> >> > It looks like the config files are exactly what I need. Thanks.
>> >> >
>> >> >
>> >> > 2014-03-17 9:26 GMT+02:00 Rupert Westenthaler <
>> >> rupert.westentha...@gmail.com
>> >> >>:
>> >> >
>> >> >> On Sat, Mar 15, 2014 at 8:34 PM, Cristian Petroaca
>> >> >> <cristian.petro...@gmail.com> wrote:
>> >> >> > Thanks Rupert.
>> >> >> >
>> >> >> > A couple more questions/issues :
>> >> >> >
>> >> >> > 1. Whenever I start the stanbol server I'm seeing this in the
>> >> >> > console
>> >> >> > output :
>> >> >> >
>> >> >>
>> >> >> This should be fixed with STANBOL-1278 [1] [2]
>> >> >>
>> >> >> > 2. Whenever I restart the server the Weighted Chains get messed
>> >> >> > up. I
>> >> >> > usually use the 'default' chain and add my engine to it so there
>> >> >> > are
>> >> 11
>> >> >> > engines in it. After the restart this chain now contains around 23
>> >> >> engines
>> >> >> > in total.
>> >> >>
>> >> >> I was not able to replicate this. What I tried was
>> >> >>
>> >> >> (1) start up the stable launcher
>> >> >> (2) add an additional engine to the default chain
>> >> >> (3) restart the launcher
>> >> >>
>> >> >> The default chain was not changed after (2) and (3). So I would need
>> >> >> further information for knowing why this is happening.
>> >> >>
>> >> >> Generally it is better to create you own chain instance as modifying
>> >> >> one that is provided by the default configuration. I would also
>> >> >> recommend that you keep your test configuration in text files and to
>> >> >> copy those to the 'stanbol/fileinstall' folder. Doing so prevent you
>> >> >> from manually entering the configuration after a software update.
>> >> >> The
>> >> >> production-mode section [3] provides information on how to do that.
>> >> >>
>> >> >> best
>> >> >> Rupert
>> >> >>
>> >> >> [1] https://issues.apache.org/jira/browse/STANBOL-1278
>> >> >> [2] http://svn.apache.org/r1576623
>> >> >> [3] http://stanbol.apache.org/docs/trunk/production-mode
>> >> >>
>> >> >> > ERROR: Bundle org.apache.stanbol.enhancer.engine.topic.web [153]:
>> >> Error
>> >> >> > starting
>> >> >> >
>> >> >>
>> >>
>> >> slinginstall:c:\Data\Projects\Stanbol\main\launchers\stable\target\stanbol\star
>> >> >> >
>> >> >> > tup\35\org.apache.stanbol.enhancer.engine.topic.web-1.0.0-SNAPSHOT.jar
>> >> >> > (org.osgi
>> >> >> > .framework.BundleException: Unresolved constraint in bundle
>> >> >> > org.apache.stanbol.e
>> >> >> > nhancer.engine.topic.web [153]: Unable to resolve 153.0: missing
>> >> >> > requirement [15
>> >> >> > 3.0] package; (&(package=javax.ws.rs
>> >> >> )(version>=0.0.0)(!(version>=2.0.0))))
>> >> >> > org.osgi.framework.BundleException: Unresolved constraint in
>> >> >> > bundle
>> >> >> > org.apache.s
>> >> >> > tanbol.enhancer.engine.topic.web [153]: Unable to resolve 153.0:
>> >> missing
>> >> >> > require
>> >> >> > ment [153.0] package; (&(package=javax.ws.rs
>> >> >> > )(version>=0.0.0)(!(version>=2.0.0))
>> >> >> > )
>> >> >> >         at
>> >> >> org.apache.felix.framework.Felix.resolveBundle(Felix.java:3443)
>> >> >> >         at
>> >> org.apache.felix.framework.Felix.startBundle(Felix.java:1727)
>> >> >> >         at
>> >> >> >
>> >> >> > org.apache.felix.framework.Felix.setActiveStartLevel(Felix.java:1156)
>> >> >> >
>> >> >> >         at
>> >> >> >
>> >> >> > org.apache.felix.framework.StartLevelImpl.run(StartLevelImpl.java:264
>> >> >> > )
>> >> >> >         at java.lang.Thread.run(Unknown Source)
>> >> >> >
>> >> >> > Despite of this the server starts fine and I can use the enhancer
>> >> fine.
>> >> >> Do
>> >> >> > you guys see this as well?
>> >> >> >
>> >> >> >
>> >> >> > 2. Whenever I restart the server the Weighted Chains get messed
>> >> >> > up. I
>> >> >> > usually use the 'default' chain and add my engine to it so there
>> >> >> > are
>> >> 11
>> >> >> > engines in it. After the restart this chain now contains around 23
>> >> >> engines
>> >> >> > in total.
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > 2014-03-11 9:47 GMT+02:00 Rupert Westenthaler <
>> >> >> rupert.westentha...@gmail.com
>> >> >> >>:
>> >> >> >
>> >> >> >> Hi Cristian,
>> >> >> >>
>> >> >> >> NER Annotations are typically available as both
>> >> >> >> NlpAnnotations.NER_ANNOTATION and  fise:TextAnnotation [1] in the
>> >> >> >> enhancement metadata. As you are already accessing the
>> >> >> >> AnayzedText I
>> >> >> >> would prefer using the  NlpAnnotations.NER_ANNOTATION.
>> >> >> >>
>> >> >> >> best
>> >> >> >> Rupert
>> >> >> >>
>> >> >> >> [1]
>> >> >> >>
>> >> >>
>> >>
>> >> http://stanbol.apache.org/docs/trunk/components/enhancer/enhancementstructure.html#fisetextannotation
>> >> >> >>
>> >> >> >> On Mon, Mar 10, 2014 at 10:07 PM, Cristian Petroaca
>> >> >> >> <cristian.petro...@gmail.com> wrote:
>> >> >> >> > Thanks.
>> >> >> >> > I assume I should get the Named entities using the same but
>> >> >> >> > with
>> >> >> >> > NlpAnnotations.NER_ANNOTATION?
>> >> >> >> >
>> >> >> >> >
>> >> >> >> >
>> >> >> >> > 2014-03-10 13:29 GMT+02:00 Rupert Westenthaler <
>> >> >> >> > rupert.westentha...@gmail.com>:
>> >> >> >> >
>> >> >> >> >> Hallo Cristian,
>> >> >> >> >>
>> >> >> >> >> NounPhrases are not added to the RDF enhancement results. You
>> >> need to
>> >> >> >> >> use the AnalyzedText ContentPart [1]
>> >> >> >> >>
>> >> >> >> >> here is some demo code you can use in the computeEnhancement
>> >> method
>> >> >> >> >>
>> >> >> >> >>         AnalysedText at =
>> >> >> >> >> NlpEngineHelper.getAnalysedText(this,
>> >> ci,
>> >> >> >> true);
>> >> >> >> >>         Iterator<? extends Section> sections =
>> >> >> >> >> at.getSentences();
>> >> >> >> >>         if(!sections.hasNext()){ //process as single sentence
>> >> >> >> >>             sections = Collections.singleton(at).iterator();
>> >> >> >> >>         }
>> >> >> >> >>
>> >> >> >> >>         while(sections.hasNext()){
>> >> >> >> >>             Section section = sections.next();
>> >> >> >> >>             Iterator<Span> chunks =
>> >> >> >> >> section.getEnclosed(EnumSet.of(SpanTypeEnum.Chunk));
>> >> >> >> >>             while(chunks.hasNext()){
>> >> >> >> >>                 Span chunk = chunks.next();
>> >> >> >> >>                 Value<PhraseTag> phrase =
>> >> >> >> >> chunk.getAnnotation(NlpAnnotations.PHRASE_ANNOTATION);
>> >> >> >> >>                 if(phrase.value().getCategory() ==
>> >> >> >> LexicalCategory.Noun){
>> >> >> >> >>                     log.info(" - NounPhrase [{},{}] {}", new
>> >> >> Object[]{
>> >> >> >> >>
>> >> >> >> >> chunk.getStart(),chunk.getEnd(),chunk.getSpan()});
>> >> >> >> >>                 }
>> >> >> >> >>             }
>> >> >> >> >>         }
>> >> >> >> >>
>> >> >> >> >> hope this helps
>> >> >> >> >>
>> >> >> >> >> best
>> >> >> >> >> Rupert
>> >> >> >> >>
>> >> >> >> >> [1]
>> >> >> >> >>
>> >> >> >>
>> >> >>
>> >>
>> >> http://stanbol.apache.org/docs/trunk/components/enhancer/nlp/analyzedtext
>> >> >> >> >>
>> >> >> >> >> On Sun, Mar 9, 2014 at 6:07 PM, Cristian Petroaca
>> >> >> >> >> <cristian.petro...@gmail.com> wrote:
>> >> >> >> >> > I started to implement the engine and I'm having problems
>> >> >> >> >> > with
>> >> >> getting
>> >> >> >> >> > results for noun phrases. I modified the "default" weighted
>> >> chain
>> >> >> to
>> >> >> >> also
>> >> >> >> >> > include the PosChunkerEngine and ran a sample text : "Angela
>> >> Merkel
>> >> >> >> >> visted
>> >> >> >> >> > China. The german chancellor met with various people". I
>> >> expected
>> >> >> that
>> >> >> >> >> the
>> >> >> >> >> > RDF XML output would contain some info about the noun
>> >> >> >> >> > phrases
>> >> but I
>> >> >> >> >> cannot
>> >> >> >> >> > see any.
>> >> >> >> >> > Could you point me to the correct way to generate the noun
>> >> phrases?
>> >> >> >> >> >
>> >> >> >> >> > Thanks,
>> >> >> >> >> > Cristian
>> >> >> >> >> >
>> >> >> >> >> >
>> >> >> >> >> > 2014-02-09 14:15 GMT+02:00 Cristian Petroaca <
>> >> >> >> >> cristian.petro...@gmail.com>:
>> >> >> >> >> >
>> >> >> >> >> >> Opened https://issues.apache.org/jira/browse/STANBOL-1279
>> >> >> >> >> >>
>> >> >> >> >> >>
>> >> >> >> >> >> 2014-02-07 10:53 GMT+02:00 Cristian Petroaca <
>> >> >> >> >> cristian.petro...@gmail.com>
>> >> >> >> >> >> :
>> >> >> >> >> >>
>> >> >> >> >> >> Hi Rupert,
>> >> >> >> >> >>>
>> >> >> >> >> >>> The "spatial" dimension is a good idea. I'll also take a
>> >> >> >> >> >>> look
>> >> at
>> >> >> >> Yago.
>> >> >> >> >> >>>
>> >> >> >> >> >>> I will create a Jira with what we talked about here. It
>> >> >> >> >> >>> will
>> >> >> >> probably
>> >> >> >> >> >>> have just a draft-like description for now and will be
>> >> >> >> >> >>> updated
>> >> >> as I
>> >> >> >> go
>> >> >> >> >> >>> along.
>> >> >> >> >> >>>
>> >> >> >> >> >>> Thanks,
>> >> >> >> >> >>> Cristian
>> >> >> >> >> >>>
>> >> >> >> >> >>>
>> >> >> >> >> >>> 2014-02-06 15:39 GMT+02:00 Rupert Westenthaler <
>> >> >> >> >> >>> rupert.westentha...@gmail.com>:
>> >> >> >> >> >>>
>> >> >> >> >> >>> Hi Cristian,
>> >> >> >> >> >>>>
>> >> >> >> >> >>>> definitely an interesting approach. You should have a
>> >> >> >> >> >>>> look at
>> >> >> Yago2
>> >> >> >> >> >>>> [1]. As far as I can remember the Yago taxonomy is much
>> >> better
>> >> >> >> >> >>>> structured as the one used by dbpedia. Mapping
>> >> >> >> >> >>>> suggestions of
>> >> >> >> dbpedia
>> >> >> >> >> >>>> to concepts in Yago2 is easy as both dbpedia and yago2 do
>> >> >> provide
>> >> >> >> >> >>>> mappings [2] and [3]
>> >> >> >> >> >>>>
>> >> >> >> >> >>>> > 2014-02-05 15:39 GMT+02:00 Rafa Haro
>> >> >> >> >> >>>> > <rh...@apache.org>:
>> >> >> >> >> >>>> >>
>> >> >> >> >> >>>> >> "Microsoft posted its 2013 earnings. The Redmond's
>> >> >> >> >> >>>> >> company
>> >> >> made
>> >> >> >> a
>> >> >> >> >> >>>> >> huge profit".
>> >> >> >> >> >>>>
>> >> >> >> >> >>>> Thats actually a very good example. Spatial contexts are
>> >> >> >> >> >>>> very
>> >> >> >> >> >>>> important as they tend to be often used for referencing.
>> >> >> >> >> >>>> So I
>> >> >> would
>> >> >> >> >> >>>> suggest to specially treat the spatial context. For
>> >> >> >> >> >>>> spatial
>> >> >> >> Entities
>> >> >> >> >> >>>> (like a City) this is easy, but even for other (like a
>> >> Person,
>> >> >> >> >> >>>> Company) you could use relations to spatial entities
>> >> >> >> >> >>>> define
>> >> >> their
>> >> >> >> >> >>>> spatial context. This context could than be used to
>> >> >> >> >> >>>> correctly
>> >> >> link
>> >> >> >> >> >>>> "The Redmond's company" to "Microsoft".
>> >> >> >> >> >>>>
>> >> >> >> >> >>>> In addition I would suggest to use the "spatial" context
>> >> >> >> >> >>>> of
>> >> each
>> >> >> >> >> >>>> entity (basically relation to entities that are cities,
>> >> regions,
>> >> >> >> >> >>>> countries) as a separate dimension, because those are
>> >> >> >> >> >>>> very
>> >> often
>> >> >> >> used
>> >> >> >> >> >>>> for coreferences.
>> >> >> >> >> >>>>
>> >> >> >> >> >>>> [1] http://www.mpi-inf.mpg.de/yago-naga/yago/
>> >> >> >> >> >>>> [2]
>> >> >> >> >> >>>> http://downloads.dbpedia.org/3.9/links/yago_links.nt.bz2
>> >> >> >> >> >>>> [3]
>> >> >> >> >> >>>>
>> >> >> >> >>
>> >> >> >>
>> >> >>
>> >>
>> >> http://www.mpi-inf.mpg.de/yago-naga/yago/download/yago/yagoDBpediaInstances.ttl.7z
>> >> >> >> >> >>>>
>> >> >> >> >> >>>>
>> >> >> >> >> >>>> On Thu, Feb 6, 2014 at 10:33 AM, Cristian Petroaca
>> >> >> >> >> >>>> <cristian.petro...@gmail.com> wrote:
>> >> >> >> >> >>>> > There are several dbpedia categories for each entity,
>> >> >> >> >> >>>> > in
>> >> this
>> >> >> >> case
>> >> >> >> >> for
>> >> >> >> >> >>>> > Microsoft we have :
>> >> >> >> >> >>>> >
>> >> >> >> >> >>>> > category:Companies_in_the_NASDAQ-100_Index
>> >> >> >> >> >>>> > category:Microsoft
>> >> >> >> >> >>>> > category:Software_companies_of_the_United_States
>> >> >> >> >> >>>> > category:Software_companies_based_in_Washington_(state)
>> >> >> >> >> >>>> > category:Companies_established_in_1975
>> >> >> >> >> >>>> > category:1975_establishments_in_the_United_States
>> >> >> >> >> >>>> > category:Companies_based_in_Redmond,_Washington
>> >> >> >> >> >>>> >
>> >> >> >>
>> >> >> >> category:Multinational_companies_headquartered_in_the_United_States
>> >> >> >> >> >>>> > category:Cloud_computing_providers
>> >> >> >> >> >>>> > category:Companies_in_the_Dow_Jones_Industrial_Average
>> >> >> >> >> >>>> >
>> >> >> >> >> >>>> > So we also have "Companies based in Redmont,Washington"
>> >> which
>> >> >> >> could
>> >> >> >> >> be
>> >> >> >> >> >>>> > matched.
>> >> >> >> >> >>>> >
>> >> >> >> >> >>>> >
>> >> >> >> >> >>>> > There is still other contextual information from
>> >> >> >> >> >>>> > dbpedia
>> >> which
>> >> >> >> can
>> >> >> >> >> be
>> >> >> >> >> >>>> used.
>> >> >> >> >> >>>> > For example for an Organization we could also include :
>> >> >> >> >> >>>> > dbpprop:industry = Software
>> >> >> >> >> >>>> > dbpprop:service = Online Service Providers
>> >> >> >> >> >>>> >
>> >> >> >> >> >>>> > and for a Person (that's for Barack Obama) :
>> >> >> >> >> >>>> >
>> >> >> >> >> >>>> > dbpedia-owl:profession:
>> >> >> >> >> >>>> >                                dbpedia:Author
>> >> >> >> >> >>>> >
>> >> >> >> >> >>>> > dbpedia:Constitutional_law
>> >> >> >> >> >>>> >                                dbpedia:Lawyer
>> >> >> >> >> >>>> >
>> >> >> >> >> >>>> > dbpedia:Community_organizing
>> >> >> >> >> >>>> >
>> >> >> >> >> >>>> > I'd like to continue investigating this as I think that
>> >> >> >> >> >>>> > it
>> >> may
>> >> >> >> have
>> >> >> >> >> >>>> some
>> >> >> >> >> >>>> > value in increasing the number of coreference
>> >> >> >> >> >>>> > resolutions
>> >> and
>> >> >> I'd
>> >> >> >> >> like
>> >> >> >> >> >>>> to
>> >> >> >> >> >>>> > concentrate more on precision rather than recall since
>> >> >> >> >> >>>> > we
>> >> >> already
>> >> >> >> >> have
>> >> >> >> >> >>>> a
>> >> >> >> >> >>>> > set of coreferences detected by the stanford nlp tool
>> >> >> >> >> >>>> > and
>> >> this
>> >> >> >> would
>> >> >> >> >> >>>> be as
>> >> >> >> >> >>>> > an addition to that (at least this is how I would like
>> >> >> >> >> >>>> > to
>> >> use
>> >> >> >> it).
>> >> >> >> >> >>>> >
>> >> >> >> >> >>>> > Is it ok if I track this by opening a jira? I could
>> >> >> >> >> >>>> > update
>> >> it
>> >> >> to
>> >> >> >> >> show
>> >> >> >> >> >>>> my
>> >> >> >> >> >>>> > progress and also my conclusions and if it turns out
>> >> >> >> >> >>>> > that
>> >> it
>> >> >> was
>> >> >> >> a
>> >> >> >> >> bad
>> >> >> >> >> >>>> idea
>> >> >> >> >> >>>> > then that's the situation at least I'll end up with
>> >> >> >> >> >>>> > more
>> >> >> >> knowledge
>> >> >> >> >> >>>> about
>> >> >> >> >> >>>> > Stanbol in the end :).
>> >> >> >> >> >>>> >
>> >> >> >> >> >>>> >
>> >> >> >> >> >>>> > 2014-02-05 15:39 GMT+02:00 Rafa Haro
>> >> >> >> >> >>>> > <rh...@apache.org>:
>> >> >> >> >> >>>> >
>> >> >> >> >> >>>> >> Hi Cristian,
>> >> >> >> >> >>>> >>
>> >> >> >> >> >>>> >> The approach sounds nice. I don't want to be the
>> >> >> >> >> >>>> >> devil's
>> >> >> >> advocate
>> >> >> >> >> but
>> >> >> >> >> >>>> I'm
>> >> >> >> >> >>>> >> just not sure about the recall using the dbpedia
>> >> categories
>> >> >> >> >> feature.
>> >> >> >> >> >>>> For
>> >> >> >> >> >>>> >> example, your sentence could be also "Microsoft posted
>> >> >> >> >> >>>> >> its
>> >> >> 2013
>> >> >> >> >> >>>> earnings.
>> >> >> >> >> >>>> >> The Redmond's company made a huge profit". So, maybe
>> >> >> including
>> >> >> >> more
>> >> >> >> >> >>>> >> contextual information from dbpedia could increase the
>> >> recall
>> >> >> >> but
>> >> >> >> >> of
>> >> >> >> >> >>>> course
>> >> >> >> >> >>>> >> will reduce the precision.
>> >> >> >> >> >>>> >>
>> >> >> >> >> >>>> >> Cheers,
>> >> >> >> >> >>>> >> Rafa
>> >> >> >> >> >>>> >>
>> >> >> >> >> >>>> >> El 04/02/14 09:50, Cristian Petroaca escribió:
>> >> >> >> >> >>>> >>
>> >> >> >> >> >>>> >>  Back with a more detailed description of the steps
>> >> >> >> >> >>>> >> for
>> >> >> making
>> >> >> >> this
>> >> >> >> >> >>>> kind of
>> >> >> >> >> >>>> >>> coreference work.
>> >> >> >> >> >>>> >>>
>> >> >> >> >> >>>> >>> I will be using references to the following text in
>> >> >> >> >> >>>> >>> the
>> >> >> steps
>> >> >> >> >> below
>> >> >> >> >> >>>> in
>> >> >> >> >> >>>> >>> order to make things clearer : "Microsoft posted its
>> >> >> >> >> >>>> >>> 2013
>> >> >> >> >> earnings.
>> >> >> >> >> >>>> The
>> >> >> >> >> >>>> >>> software company made a huge profit."
>> >> >> >> >> >>>> >>>
>> >> >> >> >> >>>> >>> 1. For every noun phrase in the text which has :
>> >> >> >> >> >>>> >>>      a. a determinate pos which implies reference to
>> >> >> >> >> >>>> >>> an
>> >> >> entity
>> >> >> >> >> local
>> >> >> >> >> >>>> to
>> >> >> >> >> >>>> >>> the
>> >> >> >> >> >>>> >>> text, such as "the, this, these") but not "another,
>> >> every",
>> >> >> etc
>> >> >> >> >> which
>> >> >> >> >> >>>> >>> implies a reference to an entity outside of the text.
>> >> >> >> >> >>>> >>>      b. having at least another noun aside from the
>> >> >> >> >> >>>> >>> main
>> >> >> >> required
>> >> >> >> >> >>>> noun
>> >> >> >> >> >>>> >>> which
>> >> >> >> >> >>>> >>> further describes it. For example I will not count
>> >> >> >> >> >>>> >>> "The
>> >> >> >> company"
>> >> >> >> >> as
>> >> >> >> >> >>>> being
>> >> >> >> >> >>>> >>> a
>> >> >> >> >> >>>> >>> legitimate candidate since this could create a lot of
>> >> false
>> >> >> >> >> >>>> positives by
>> >> >> >> >> >>>> >>> considering the double meaning of some words such as
>> >> >> >> >> >>>> >>> "in
>> >> the
>> >> >> >> >> company
>> >> >> >> >> >>>> of
>> >> >> >> >> >>>> >>> good people".
>> >> >> >> >> >>>> >>> "The software company" is a good candidate since we
>> >> >> >> >> >>>> >>> also
>> >> >> have
>> >> >> >> >> >>>> "software".
>> >> >> >> >> >>>> >>>
>> >> >> >> >> >>>> >>> 2. match the nouns in the noun phrase to the contents
>> >> >> >> >> >>>> >>> of
>> >> the
>> >> >> >> >> dbpedia
>> >> >> >> >> >>>> >>> categories of each named entity found prior to the
>> >> location
>> >> >> of
>> >> >> >> the
>> >> >> >> >> >>>> noun
>> >> >> >> >> >>>> >>> phrase in the text.
>> >> >> >> >> >>>> >>> The dbpedia categories are in the following format
>> >> >> >> >> >>>> >>> (for
>> >> >> >> Microsoft
>> >> >> >> >> for
>> >> >> >> >> >>>> >>> example) : "Software companies of the United States".
>> >> >> >> >> >>>> >>>   So we try to match "software company" with that.
>> >> >> >> >> >>>> >>> First, as you can see, the main noun in the dbpedia
>> >> category
>> >> >> >> has a
>> >> >> >> >> >>>> plural
>> >> >> >> >> >>>> >>> form and it's the same for all categories which I
>> >> >> >> >> >>>> >>> saw. I
>> >> >> don't
>> >> >> >> >> know
>> >> >> >> >> >>>> if
>> >> >> >> >> >>>> >>> there's an easier way to do this but I thought of
>> >> applying a
>> >> >> >> >> >>>> lemmatizer on
>> >> >> >> >> >>>> >>> the category and the noun phrase in order for them to
>> >> have a
>> >> >> >> >> common
>> >> >> >> >> >>>> >>> denominator.This also works if the noun phrase itself
>> >> has a
>> >> >> >> plural
>> >> >> >> >> >>>> form.
>> >> >> >> >> >>>> >>>
>> >> >> >> >> >>>> >>> Second, I'll need to use for comparison only the
>> >> >> >> >> >>>> >>> words in
>> >> >> the
>> >> >> >> >> >>>> category
>> >> >> >> >> >>>> >>> which are themselves nouns and not prepositions or
>> >> >> determiners
>> >> >> >> >> such
>> >> >> >> >> >>>> as "of
>> >> >> >> >> >>>> >>> the".This means that I need to pos tag the categories
>> >> >> contents
>> >> >> >> as
>> >> >> >> >> >>>> well.
>> >> >> >> >> >>>> >>> I was thinking of running the pos and lemma on the
>> >> dbpedia
>> >> >> >> >> >>>> categories when
>> >> >> >> >> >>>> >>> building the dbpedia backed entity hub and storing
>> >> >> >> >> >>>> >>> them
>> >> for
>> >> >> >> later
>> >> >> >> >> >>>> use - I
>> >> >> >> >> >>>> >>> don't know how feasible this is at the moment.
>> >> >> >> >> >>>> >>>
>> >> >> >> >> >>>> >>> After this I can compare each noun in the noun phrase
>> >> with
>> >> >> the
>> >> >> >> >> >>>> equivalent
>> >> >> >> >> >>>> >>> nouns in the categories and based on the number of
>> >> matches I
>> >> >> >> can
>> >> >> >> >> >>>> create a
>> >> >> >> >> >>>> >>> confidence level.
>> >> >> >> >> >>>> >>>
>> >> >> >> >> >>>> >>> 3. match the noun of the noun phrase with the
>> >> >> >> >> >>>> >>> rdf:type
>> >> from
>> >> >> >> >> dbpedia
>> >> >> >> >> >>>> of the
>> >> >> >> >> >>>> >>> named entity. If this matches increase the confidence
>> >> level.
>> >> >> >> >> >>>> >>>
>> >> >> >> >> >>>> >>> 4. If there are multiple named entities which can
>> >> >> >> >> >>>> >>> match a
>> >> >> >> certain
>> >> >> >> >> >>>> noun
>> >> >> >> >> >>>> >>> phrase then link the noun phrase with the closest
>> >> >> >> >> >>>> >>> named
>> >> >> entity
>> >> >> >> >> prior
>> >> >> >> >> >>>> to it
>> >> >> >> >> >>>> >>> in the text.
>> >> >> >> >> >>>> >>>
>> >> >> >> >> >>>> >>> What do you think?
>> >> >> >> >> >>>> >>>
>> >> >> >> >> >>>> >>> Cristian
>> >> >> >> >> >>>> >>>
>> >> >> >> >> >>>> >>> 2014-01-31 Cristian Petroaca <
>> >> cristian.petro...@gmail.com>:
>> >> >> >> >> >>>> >>>
>> >> >> >> >> >>>> >>>  Hi Rafa,
>> >> >> >> >> >>>> >>>>
>> >> >> >> >> >>>> >>>> I don't yet have a concrete heursitic but I'm
>> >> >> >> >> >>>> >>>> working on
>> >> >> it.
>> >> >> >> I'll
>> >> >> >> >> >>>> provide
>> >> >> >> >> >>>> >>>> it here so that you guys can give me a feedback on
>> >> >> >> >> >>>> >>>> it.
>> >> >> >> >> >>>> >>>>
>> >> >> >> >> >>>> >>>> What are "locality" features?
>> >> >> >> >> >>>> >>>>
>> >> >> >> >> >>>> >>>> I looked at Bart and other coref tools such as
>> >> >> >> >> >>>> >>>> ArkRef
>> >> and
>> >> >> >> >> >>>> CherryPicker
>> >> >> >> >> >>>> >>>> and
>> >> >> >> >> >>>> >>>> they don't provide such a coreference.
>> >> >> >> >> >>>> >>>>
>> >> >> >> >> >>>> >>>> Cristian
>> >> >> >> >> >>>> >>>>
>> >> >> >> >> >>>> >>>>
>> >> >> >> >> >>>> >>>> 2014-01-30 Rafa Haro <rh...@apache.org>:
>> >> >> >> >> >>>> >>>>
>> >> >> >> >> >>>> >>>> Hi Cristian,
>> >> >> >> >> >>>> >>>>
>> >> >> >> >> >>>> >>>>> Without having more details about your concrete
>> >> heuristic,
>> >> >> >> in my
>> >> >> >> >> >>>> honest
>> >> >> >> >> >>>> >>>>> opinion, such approach could produce a lot of false
>> >> >> >> positives. I
>> >> >> >> >> >>>> don't
>> >> >> >> >> >>>> >>>>> know
>> >> >> >> >> >>>> >>>>> if you are planning to use some "locality" features
>> >> >> >> >> >>>> >>>>> to
>> >> >> detect
>> >> >> >> >> such
>> >> >> >> >> >>>> >>>>> coreferences but you need to take into account that
>> >> >> >> >> >>>> >>>>> it
>> >> is
>> >> >> >> quite
>> >> >> >> >> >>>> usual
>> >> >> >> >> >>>> >>>>> that
>> >> >> >> >> >>>> >>>>> coreferenced mentions can occurs even in different
>> >> >> >> paragraphs.
>> >> >> >> >> >>>> Although
>> >> >> >> >> >>>> >>>>> I'm
>> >> >> >> >> >>>> >>>>> not an expert in Natural Language Understanding, I
>> >> would
>> >> >> say
>> >> >> >> it
>> >> >> >> >> is
>> >> >> >> >> >>>> quite
>> >> >> >> >> >>>> >>>>> difficult to get decent precision/recall rates for
>> >> >> >> coreferencing
>> >> >> >> >> >>>> using
>> >> >> >> >> >>>> >>>>> fixed rules. Maybe you can give a try to others
>> >> >> >> >> >>>> >>>>> tools
>> >> like
>> >> >> >> BART
>> >> >> >> >> (
>> >> >> >> >> >>>> >>>>> http://www.bart-coref.org/).
>> >> >> >> >> >>>> >>>>>
>> >> >> >> >> >>>> >>>>> Cheers,
>> >> >> >> >> >>>> >>>>> Rafa Haro
>> >> >> >> >> >>>> >>>>>
>> >> >> >> >> >>>> >>>>> El 30/01/14 10:33, Cristian Petroaca escribió:
>> >> >> >> >> >>>> >>>>>
>> >> >> >> >> >>>> >>>>>   Hi,
>> >> >> >> >> >>>> >>>>>
>> >> >> >> >> >>>> >>>>>> One of the necessary steps for implementing the
>> >> >> >> >> >>>> >>>>>> Event
>> >> >> >> >> extraction
>> >> >> >> >> >>>> Engine
>> >> >> >> >> >>>> >>>>>> feature :
>> >> >> >> https://issues.apache.org/jira/browse/STANBOL-1121is
>> >> >> >> >> >>>> to
>> >> >> >> >> >>>> >>>>>> have
>> >> >> >> >> >>>> >>>>>> coreference resolution in the given text. This is
>> >> >> provided
>> >> >> >> now
>> >> >> >> >> >>>> via the
>> >> >> >> >> >>>> >>>>>> stanford-nlp project but as far as I saw this
>> >> >> >> >> >>>> >>>>>> module
>> >> is
>> >> >> >> >> performing
>> >> >> >> >> >>>> >>>>>> mostly
>> >> >> >> >> >>>> >>>>>> pronomial (He, She) or nominal (Barack Obama and
>> >> >> >> >> >>>> >>>>>> Mr.
>> >> >> Obama)
>> >> >> >> >> >>>> coreference
>> >> >> >> >> >>>> >>>>>> resolution.
>> >> >> >> >> >>>> >>>>>>
>> >> >> >> >> >>>> >>>>>> In order to get more coreferences from the text I
>> >> though
>> >> >> of
>> >> >> >> >> >>>> creating
>> >> >> >> >> >>>> >>>>>> some
>> >> >> >> >> >>>> >>>>>> logic that would detect this kind of coreference :
>> >> >> >> >> >>>> >>>>>> "Apple reaches new profit heights. The software
>> >> company
>> >> >> just
>> >> >> >> >> >>>> announced
>> >> >> >> >> >>>> >>>>>> its
>> >> >> >> >> >>>> >>>>>> 2013 earnings."
>> >> >> >> >> >>>> >>>>>> Here "The software company" obviously refers to
>> >> "Apple".
>> >> >> >> >> >>>> >>>>>> So I'd like to detect coreferences of Named
>> >> >> >> >> >>>> >>>>>> Entities
>> >> >> which
>> >> >> >> are
>> >> >> >> >> of
>> >> >> >> >> >>>> the
>> >> >> >> >> >>>> >>>>>> rdf:type of the Named Entity , in this case
>> >> >> >> >> >>>> >>>>>> "company"
>> >> and
>> >> >> >> also
>> >> >> >> >> >>>> have
>> >> >> >> >> >>>> >>>>>> attributes which can be found in the dbpedia
>> >> categories
>> >> >> of
>> >> >> >> the
>> >> >> >> >> >>>> named
>> >> >> >> >> >>>> >>>>>> entity, in this case "software".
>> >> >> >> >> >>>> >>>>>>
>> >> >> >> >> >>>> >>>>>> The detection of coreferences such as "The
>> >> >> >> >> >>>> >>>>>> software
>> >> >> >> company" in
>> >> >> >> >> >>>> the
>> >> >> >> >> >>>> >>>>>> text
>> >> >> >> >> >>>> >>>>>> would also be done by either using the new Pos Tag
>> >> Based
>> >> >> >> Phrase
>> >> >> >> >> >>>> >>>>>> extraction
>> >> >> >> >> >>>> >>>>>> Engine (noun phrases) or by using a dependency
>> >> >> >> >> >>>> >>>>>> tree of
>> >> >> the
>> >> >> >> >> >>>> sentence and
>> >> >> >> >> >>>> >>>>>> picking up only subjects or objects.
>> >> >> >> >> >>>> >>>>>>
>> >> >> >> >> >>>> >>>>>> At this point I'd like to know if this kind of
>> >> >> >> >> >>>> >>>>>> logic
>> >> >> would
>> >> >> >> be
>> >> >> >> >> >>>> useful
>> >> >> >> >> >>>> >>>>>> as a
>> >> >> >> >> >>>> >>>>>> separate Enhancement Engine (in case the precision
>> >> >> >> >> >>>> >>>>>> and
>> >> >> >> recall
>> >> >> >> >> are
>> >> >> >> >> >>>> good
>> >> >> >> >> >>>> >>>>>> enough) in Stanbol?
>> >> >> >> >> >>>> >>>>>>
>> >> >> >> >> >>>> >>>>>> Thanks,
>> >> >> >> >> >>>> >>>>>> Cristian
>> >> >> >> >> >>>> >>>>>>
>> >> >> >> >> >>>> >>>>>>
>> >> >> >> >> >>>> >>>>>>
>> >> >> >> >> >>>> >>
>> >> >> >> >> >>>>
>> >> >> >> >> >>>>
>> >> >> >> >> >>>>
>> >> >> >> >> >>>> --
>> >> >> >> >> >>>> | Rupert Westenthaler
>> >> rupert.westentha...@gmail.com
>> >> >> >> >> >>>> | Bodenlehenstraße 11
>> >> >> >> ++43-699-11108907
>> >> >> >> >> >>>> | A-5500 Bischofshofen
>> >> >> >> >> >>>>
>> >> >> >> >> >>>
>> >> >> >> >> >>>
>> >> >> >> >> >>
>> >> >> >> >>
>> >> >> >> >>
>> >> >> >> >>
>> >> >> >> >> --
>> >> >> >> >> | Rupert Westenthaler
>> >> >> >> >> rupert.westentha...@gmail.com
>> >> >> >> >> | Bodenlehenstraße 11
>> >> ++43-699-11108907
>> >> >> >> >> | A-5500 Bischofshofen
>> >> >> >> >>
>> >> >> >>
>> >> >> >>
>> >> >> >>
>> >> >> >> --
>> >> >> >> | Rupert Westenthaler             rupert.westentha...@gmail.com
>> >> >> >> | Bodenlehenstraße 11
>> >> >> >> ++43-699-11108907
>> >> >> >> | A-5500 Bischofshofen
>> >> >> >>
>> >> >>
>> >> >>
>> >> >>
>> >> >> --
>> >> >> | Rupert Westenthaler             rupert.westentha...@gmail.com
>> >> >> | Bodenlehenstraße 11                             ++43-699-11108907
>> >> >> | A-5500 Bischofshofen
>> >> >>
>> >>
>> >>
>> >>
>> >> --
>> >> | Rupert Westenthaler             rupert.westentha...@gmail.com
>> >> | Bodenlehenstraße 11                             ++43-699-11108907
>> >> | A-5500 Bischofshofen
>> >>
>>
>>
>>
>> --
>> | Rupert Westenthaler             rupert.westentha...@gmail.com
>> | Bodenlehenstraße 11                             ++43-699-11108907
>> | A-5500 Bischofshofen
>
>



-- 
| Rupert Westenthaler             rupert.westentha...@gmail.com
| Bodenlehenstraße 11                             ++43-699-11108907
| A-5500 Bischofshofen

Re: Named entity coref resolution based on dbpedia categories and rdf:type

Reply via email to