Re: Update CELI engines to use Stanbol NLP processing

Rupert Westenthaler Fri, 21 Sep 2012 06:31:10 -0700

Hi,

forgot to include the dev list in my last response to Alessio, hence the forward


On Fri, Sep 21, 2012 at 3:22 PM, Rupert Westenthaler
<[email protected]> wrote:
> Hi Alessio,
>
> On Fri, Sep 21, 2012 at 12:51 PM, Alessio Bosca <[email protected]> wrote:
>> We are surely willing to contribute to the development of the engines and I
>> will work on the requested modifications for supporting the  AnalyzedText
>> content part.
>
> Thats cool to hear. I already started some thinks. I will commit those
> later today so that you can continue from their.
>
>> We will also provide you a mapping for the POS tagset and the other lexical
>> features.
>
> If there is a documentation of the POS Tag Sets are available it would
> be cool if you could link those. When I commit my local changes there
> will be a "PosTagSetRegistry" in
> "org.apache.stanbol.enhancer.engines.celi" where you can add the
> mappings.
>
>>I will check with the team responsible for the morphological
>> analyzer about the confidence level or the ranking of multiple readings as
>> I'm not sure about that.
>>
>> Concerning the missing readings for some lexical entries it is because the
>> unrecognized term are not present in the lexicon of the morphological
>> analyzer; they are "unknown" words so to say.
>> It happens with mispelled words or unknown named entities. It is possible to
>> explicitly set a POS "Unknown" lexical feature for them, if you wish so, but
>> there are no lexical feature retrieved by the morphological analyzer itself.
>> Let me know if you want this update as well.
>> Calling the named entities engine for Italian may be an alternative way for
>> getting more info on that textual fragments.
>>
>
> OK that explains a lot. I had the impression that there is first a POS
> tagger and than a morphological analyzer uses those results to provide
> the lemmas and other information. If the morphological analyzer adds
> possible lemmas based on words I would expect that there are no
> results for some words and also that there are multiple readings for
> others.
>
> Does linguagrid also have a POS tagging service?
>
>> I will send you an update next week as soon as I finished to integrate the
>> updates
>>
>
> I am in Leibzig next week so I might be not as responsive as usually.
>
> best
> Rupert
>
>>
>> Bests
>>     Alessio
>>
>>
>> On 09/21/2012 09:16 AM, Rupert Westenthaler wrote:
>>>
>>> Hi Alessio, all
>>>
>>> I have started to work on the migration of the CELI lemmatizer Engine
>>> to the new Stanbol NLP processing module (STANBOL-733, STANBOL-738).
>>> Basically the Idea was to adapt the Lemmatizer Engine to use the
>>> AnalysedText ContentPart (STANBOL-734) to store its result. The goal
>>> of this work is being able to use word level NLP analyses result of
>>> CELI in Apache Stanbol (e.g. CELI POS tags and lemma information for
>>> looking up terms with the KeywordLinkingEngine). Achieving this would
>>> open up a lot of additional possibilities for Stanbol Users that want
>>> to use the CELI services.
>>>
>>> While working on this I came across the following things:
>>>
>>> (1) I recognized that the Lemmatizer Service does not provide
>>> information for all Words (LexicalEntry). As an example in the
>>> sentence
>>>
>>>      Lo scandalo dei fondi pubblici sperperati in allegria dalla Regione
>>>      Lazio ha dato i primi frutti: ieri il capogruppo Pdl Francesco
>>> Battistoni
>>>      si è dimesso e la sede del Consiglio è stata invasa dalla Guardia
>>> di Finanza.
>>>
>>> the LexicalEntries for "Pdl Francesco Battistoni si" do not have any
>>> metadata (no <Reading>). Do you know why this is the case? Is their a
>>> possibility to obtain LexicalFeatures for all words?
>>>
>>> (2) The Stanbol NLP processing module maps POS tag sets used by NLP
>>> processing frameworks to Morphosyntactic Categories defined by the
>>> OLIA ontology [1]. Uses Categories are defined by the LexicalCategory
>>> enumeration [2]. Actual POS tags are represented by the PosTag class
>>> [3] that provides (1) the tag as string and optionally (2) the
>>> LexicalCategory. While LexicalCategories are optional they are
>>> important as they allow other components to determine the type of a
>>> word in an language independent way. Because of that it would be
>>> important to map the POS tag sets used by CELI to the
>>> LexicalCategories used by the Stanbol NLP processing module. Can you
>>> point me to documentation of the POS tag sets used by CELI for the
>>> different languages?
>>>
>>> The following code snippet shows how such a mapping could look like for
>>> Italian:
>>>
>>>      public static final TagSet<PosTag> ITALIEN = new
>>> TagSet<PosTag>("CELI Italian","it");
>>>
>>>      static {
>>>          DEFAULT.addTag(new PosTag("ADJ",LexicalCategory.Adjective));
>>>          DEFAULT.addTag(new PosTag("ADV",LexicalCategory.Adverb));
>>>          DEFAULT.addTag(new
>>> PosTag("ART",LexicalCategory.PronounOrDeterminer));
>>>          DEFAULT.addTag(new PosTag("CLI")); //mapping ??
>>>          DEFAULT.addTag(new PosTag("CONJ",LexicalCategory.Conjuction));
>>>          DEFAULT.addTag(new PosTag("PREP",LexicalCategory.Adposition));
>>>          DEFAULT.addTag(new PosTag("NF",LexicalCategory.Noun));
>>>          DEFAULT.addTag(new PosTag("NM",LexicalCategory.Noun));
>>>          DEFAULT.addTag(new PosTag("V",LexicalCategory.Verb));
>>>          getInstance().add(DEFAULT);
>>>      }
>>>
>>> BTW I would be also interested in mappings of the other
>>> LexicalFeatures extracted by CELI to the OLIA ontology (e.g. GENDER ->
>>> olia:GenderFeature, NUMBER -> olia:NumberFeature, VERB_TENSE ->
>>> olia:TenseFeature, ...).
>>>
>>> (3) The Lemmatizer Engine does not provide confidence (probabilities)
>>> for the extracted Features. If those information are available it
>>> would be great to have them available. Otherwise can I assume that the
>>> things mentioned first in the XML file do have a higher probability as
>>> additional options (e.g. <LexicalEntry> with multiple <Reading>)?
>>>
>>> The code related to STANBOL-733 is developed in the
>>> "stanbol-nlp-processing" branch
>>>
>>>      svn.apache.org/repos/asf/stanbol/branches/stanbol-nlp-processing/
>>>
>>> best
>>> Rupert Westenthaler
>>>
>>>
>>>
>>> [1] http://purl.org/olia/olia.owl
>>> [2]
>>> http://svn.apache.org/repos/asf/stanbol/branches/stanbol-nlp-processing/enhancer/generic/nlp/src/main/java/org/apache/stanbol/enhancer/nlp/pos/LexicalCategory.java
>>> [3]
>>> http://svn.apache.org/repos/asf/stanbol/branches/stanbol-nlp-processing/enhancer/generic/nlp/src/main/java/org/apache/stanbol/enhancer/nlp/pos/PosTag.java
>>
>>
>>
>> --
>> *************************************
>> Alessio Bosca, Ph.D.
>> CELI s.r.l.
>> Via San Quintino 31
>> 10121 Torino
>> Tel. +39 011.562.71.15
>> Fax +39 011.506.40.86
>> http://www.celi.it
>> *************************************
>>
>>
>
>
>
> --
> | Rupert Westenthaler             [email protected]
> | Bodenlehenstraße 11                             ++43-699-11108907
> | A-5500 Bischofshofen



-- 
| Rupert Westenthaler             [email protected]
| Bodenlehenstraße 11                             ++43-699-11108907
| A-5500 Bischofshofen

Re: Update CELI engines to use Stanbol NLP processing

Reply via email to