Hallo Cristian, NounPhrases are not added to the RDF enhancement results. You need to use the AnalyzedText ContentPart [1]
here is some demo code you can use in the computeEnhancement method AnalysedText at = NlpEngineHelper.getAnalysedText(this, ci, true); Iterator<? extends Section> sections = at.getSentences(); if(!sections.hasNext()){ //process as single sentence sections = Collections.singleton(at).iterator(); } while(sections.hasNext()){ Section section = sections.next(); Iterator<Span> chunks = section.getEnclosed(EnumSet.of(SpanTypeEnum.Chunk)); while(chunks.hasNext()){ Span chunk = chunks.next(); Value<PhraseTag> phrase = chunk.getAnnotation(NlpAnnotations.PHRASE_ANNOTATION); if(phrase.value().getCategory() == LexicalCategory.Noun){ log.info(" - NounPhrase [{},{}] {}", new Object[]{ chunk.getStart(),chunk.getEnd(),chunk.getSpan()}); } } } hope this helps best Rupert [1] http://stanbol.apache.org/docs/trunk/components/enhancer/nlp/analyzedtext On Sun, Mar 9, 2014 at 6:07 PM, Cristian Petroaca <cristian.petro...@gmail.com> wrote: > I started to implement the engine and I'm having problems with getting > results for noun phrases. I modified the "default" weighted chain to also > include the PosChunkerEngine and ran a sample text : "Angela Merkel visted > China. The german chancellor met with various people". I expected that the > RDF XML output would contain some info about the noun phrases but I cannot > see any. > Could you point me to the correct way to generate the noun phrases? > > Thanks, > Cristian > > > 2014-02-09 14:15 GMT+02:00 Cristian Petroaca <cristian.petro...@gmail.com>: > >> Opened https://issues.apache.org/jira/browse/STANBOL-1279 >> >> >> 2014-02-07 10:53 GMT+02:00 Cristian Petroaca <cristian.petro...@gmail.com> >> : >> >> Hi Rupert, >>> >>> The "spatial" dimension is a good idea. I'll also take a look at Yago. >>> >>> I will create a Jira with what we talked about here. It will probably >>> have just a draft-like description for now and will be updated as I go >>> along. >>> >>> Thanks, >>> Cristian >>> >>> >>> 2014-02-06 15:39 GMT+02:00 Rupert Westenthaler < >>> rupert.westentha...@gmail.com>: >>> >>> Hi Cristian, >>>> >>>> definitely an interesting approach. You should have a look at Yago2 >>>> [1]. As far as I can remember the Yago taxonomy is much better >>>> structured as the one used by dbpedia. Mapping suggestions of dbpedia >>>> to concepts in Yago2 is easy as both dbpedia and yago2 do provide >>>> mappings [2] and [3] >>>> >>>> > 2014-02-05 15:39 GMT+02:00 Rafa Haro <rh...@apache.org>: >>>> >> >>>> >> "Microsoft posted its 2013 earnings. The Redmond's company made a >>>> >> huge profit". >>>> >>>> Thats actually a very good example. Spatial contexts are very >>>> important as they tend to be often used for referencing. So I would >>>> suggest to specially treat the spatial context. For spatial Entities >>>> (like a City) this is easy, but even for other (like a Person, >>>> Company) you could use relations to spatial entities define their >>>> spatial context. This context could than be used to correctly link >>>> "The Redmond's company" to "Microsoft". >>>> >>>> In addition I would suggest to use the "spatial" context of each >>>> entity (basically relation to entities that are cities, regions, >>>> countries) as a separate dimension, because those are very often used >>>> for coreferences. >>>> >>>> [1] http://www.mpi-inf.mpg.de/yago-naga/yago/ >>>> [2] http://downloads.dbpedia.org/3.9/links/yago_links.nt.bz2 >>>> [3] >>>> http://www.mpi-inf.mpg.de/yago-naga/yago/download/yago/yagoDBpediaInstances.ttl.7z >>>> >>>> >>>> On Thu, Feb 6, 2014 at 10:33 AM, Cristian Petroaca >>>> <cristian.petro...@gmail.com> wrote: >>>> > There are several dbpedia categories for each entity, in this case for >>>> > Microsoft we have : >>>> > >>>> > category:Companies_in_the_NASDAQ-100_Index >>>> > category:Microsoft >>>> > category:Software_companies_of_the_United_States >>>> > category:Software_companies_based_in_Washington_(state) >>>> > category:Companies_established_in_1975 >>>> > category:1975_establishments_in_the_United_States >>>> > category:Companies_based_in_Redmond,_Washington >>>> > category:Multinational_companies_headquartered_in_the_United_States >>>> > category:Cloud_computing_providers >>>> > category:Companies_in_the_Dow_Jones_Industrial_Average >>>> > >>>> > So we also have "Companies based in Redmont,Washington" which could be >>>> > matched. >>>> > >>>> > >>>> > There is still other contextual information from dbpedia which can be >>>> used. >>>> > For example for an Organization we could also include : >>>> > dbpprop:industry = Software >>>> > dbpprop:service = Online Service Providers >>>> > >>>> > and for a Person (that's for Barack Obama) : >>>> > >>>> > dbpedia-owl:profession: >>>> > dbpedia:Author >>>> > dbpedia:Constitutional_law >>>> > dbpedia:Lawyer >>>> > dbpedia:Community_organizing >>>> > >>>> > I'd like to continue investigating this as I think that it may have >>>> some >>>> > value in increasing the number of coreference resolutions and I'd like >>>> to >>>> > concentrate more on precision rather than recall since we already have >>>> a >>>> > set of coreferences detected by the stanford nlp tool and this would >>>> be as >>>> > an addition to that (at least this is how I would like to use it). >>>> > >>>> > Is it ok if I track this by opening a jira? I could update it to show >>>> my >>>> > progress and also my conclusions and if it turns out that it was a bad >>>> idea >>>> > then that's the situation at least I'll end up with more knowledge >>>> about >>>> > Stanbol in the end :). >>>> > >>>> > >>>> > 2014-02-05 15:39 GMT+02:00 Rafa Haro <rh...@apache.org>: >>>> > >>>> >> Hi Cristian, >>>> >> >>>> >> The approach sounds nice. I don't want to be the devil's advocate but >>>> I'm >>>> >> just not sure about the recall using the dbpedia categories feature. >>>> For >>>> >> example, your sentence could be also "Microsoft posted its 2013 >>>> earnings. >>>> >> The Redmond's company made a huge profit". So, maybe including more >>>> >> contextual information from dbpedia could increase the recall but of >>>> course >>>> >> will reduce the precision. >>>> >> >>>> >> Cheers, >>>> >> Rafa >>>> >> >>>> >> El 04/02/14 09:50, Cristian Petroaca escribió: >>>> >> >>>> >> Back with a more detailed description of the steps for making this >>>> kind of >>>> >>> coreference work. >>>> >>> >>>> >>> I will be using references to the following text in the steps below >>>> in >>>> >>> order to make things clearer : "Microsoft posted its 2013 earnings. >>>> The >>>> >>> software company made a huge profit." >>>> >>> >>>> >>> 1. For every noun phrase in the text which has : >>>> >>> a. a determinate pos which implies reference to an entity local >>>> to >>>> >>> the >>>> >>> text, such as "the, this, these") but not "another, every", etc which >>>> >>> implies a reference to an entity outside of the text. >>>> >>> b. having at least another noun aside from the main required >>>> noun >>>> >>> which >>>> >>> further describes it. For example I will not count "The company" as >>>> being >>>> >>> a >>>> >>> legitimate candidate since this could create a lot of false >>>> positives by >>>> >>> considering the double meaning of some words such as "in the company >>>> of >>>> >>> good people". >>>> >>> "The software company" is a good candidate since we also have >>>> "software". >>>> >>> >>>> >>> 2. match the nouns in the noun phrase to the contents of the dbpedia >>>> >>> categories of each named entity found prior to the location of the >>>> noun >>>> >>> phrase in the text. >>>> >>> The dbpedia categories are in the following format (for Microsoft for >>>> >>> example) : "Software companies of the United States". >>>> >>> So we try to match "software company" with that. >>>> >>> First, as you can see, the main noun in the dbpedia category has a >>>> plural >>>> >>> form and it's the same for all categories which I saw. I don't know >>>> if >>>> >>> there's an easier way to do this but I thought of applying a >>>> lemmatizer on >>>> >>> the category and the noun phrase in order for them to have a common >>>> >>> denominator.This also works if the noun phrase itself has a plural >>>> form. >>>> >>> >>>> >>> Second, I'll need to use for comparison only the words in the >>>> category >>>> >>> which are themselves nouns and not prepositions or determiners such >>>> as "of >>>> >>> the".This means that I need to pos tag the categories contents as >>>> well. >>>> >>> I was thinking of running the pos and lemma on the dbpedia >>>> categories when >>>> >>> building the dbpedia backed entity hub and storing them for later >>>> use - I >>>> >>> don't know how feasible this is at the moment. >>>> >>> >>>> >>> After this I can compare each noun in the noun phrase with the >>>> equivalent >>>> >>> nouns in the categories and based on the number of matches I can >>>> create a >>>> >>> confidence level. >>>> >>> >>>> >>> 3. match the noun of the noun phrase with the rdf:type from dbpedia >>>> of the >>>> >>> named entity. If this matches increase the confidence level. >>>> >>> >>>> >>> 4. If there are multiple named entities which can match a certain >>>> noun >>>> >>> phrase then link the noun phrase with the closest named entity prior >>>> to it >>>> >>> in the text. >>>> >>> >>>> >>> What do you think? >>>> >>> >>>> >>> Cristian >>>> >>> >>>> >>> 2014-01-31 Cristian Petroaca <cristian.petro...@gmail.com>: >>>> >>> >>>> >>> Hi Rafa, >>>> >>>> >>>> >>>> I don't yet have a concrete heursitic but I'm working on it. I'll >>>> provide >>>> >>>> it here so that you guys can give me a feedback on it. >>>> >>>> >>>> >>>> What are "locality" features? >>>> >>>> >>>> >>>> I looked at Bart and other coref tools such as ArkRef and >>>> CherryPicker >>>> >>>> and >>>> >>>> they don't provide such a coreference. >>>> >>>> >>>> >>>> Cristian >>>> >>>> >>>> >>>> >>>> >>>> 2014-01-30 Rafa Haro <rh...@apache.org>: >>>> >>>> >>>> >>>> Hi Cristian, >>>> >>>> >>>> >>>>> Without having more details about your concrete heuristic, in my >>>> honest >>>> >>>>> opinion, such approach could produce a lot of false positives. I >>>> don't >>>> >>>>> know >>>> >>>>> if you are planning to use some "locality" features to detect such >>>> >>>>> coreferences but you need to take into account that it is quite >>>> usual >>>> >>>>> that >>>> >>>>> coreferenced mentions can occurs even in different paragraphs. >>>> Although >>>> >>>>> I'm >>>> >>>>> not an expert in Natural Language Understanding, I would say it is >>>> quite >>>> >>>>> difficult to get decent precision/recall rates for coreferencing >>>> using >>>> >>>>> fixed rules. Maybe you can give a try to others tools like BART ( >>>> >>>>> http://www.bart-coref.org/). >>>> >>>>> >>>> >>>>> Cheers, >>>> >>>>> Rafa Haro >>>> >>>>> >>>> >>>>> El 30/01/14 10:33, Cristian Petroaca escribió: >>>> >>>>> >>>> >>>>> Hi, >>>> >>>>> >>>> >>>>>> One of the necessary steps for implementing the Event extraction >>>> Engine >>>> >>>>>> feature : https://issues.apache.org/jira/browse/STANBOL-1121 is >>>> to >>>> >>>>>> have >>>> >>>>>> coreference resolution in the given text. This is provided now >>>> via the >>>> >>>>>> stanford-nlp project but as far as I saw this module is performing >>>> >>>>>> mostly >>>> >>>>>> pronomial (He, She) or nominal (Barack Obama and Mr. Obama) >>>> coreference >>>> >>>>>> resolution. >>>> >>>>>> >>>> >>>>>> In order to get more coreferences from the text I though of >>>> creating >>>> >>>>>> some >>>> >>>>>> logic that would detect this kind of coreference : >>>> >>>>>> "Apple reaches new profit heights. The software company just >>>> announced >>>> >>>>>> its >>>> >>>>>> 2013 earnings." >>>> >>>>>> Here "The software company" obviously refers to "Apple". >>>> >>>>>> So I'd like to detect coreferences of Named Entities which are of >>>> the >>>> >>>>>> rdf:type of the Named Entity , in this case "company" and also >>>> have >>>> >>>>>> attributes which can be found in the dbpedia categories of the >>>> named >>>> >>>>>> entity, in this case "software". >>>> >>>>>> >>>> >>>>>> The detection of coreferences such as "The software company" in >>>> the >>>> >>>>>> text >>>> >>>>>> would also be done by either using the new Pos Tag Based Phrase >>>> >>>>>> extraction >>>> >>>>>> Engine (noun phrases) or by using a dependency tree of the >>>> sentence and >>>> >>>>>> picking up only subjects or objects. >>>> >>>>>> >>>> >>>>>> At this point I'd like to know if this kind of logic would be >>>> useful >>>> >>>>>> as a >>>> >>>>>> separate Enhancement Engine (in case the precision and recall are >>>> good >>>> >>>>>> enough) in Stanbol? >>>> >>>>>> >>>> >>>>>> Thanks, >>>> >>>>>> Cristian >>>> >>>>>> >>>> >>>>>> >>>> >>>>>> >>>> >> >>>> >>>> >>>> >>>> -- >>>> | Rupert Westenthaler rupert.westentha...@gmail.com >>>> | Bodenlehenstraße 11 ++43-699-11108907 >>>> | A-5500 Bischofshofen >>>> >>> >>> >> -- | Rupert Westenthaler rupert.westentha...@gmail.com | Bodenlehenstraße 11 ++43-699-11108907 | A-5500 Bischofshofen