Hi Rupert,

sorry for the late reply but in the previous days I was out of the office for a project meeting. We are surely willing to contribute to the development of the engines and I will work on the requested modifications for supporting the AnalyzedText content part. We will also provide you a mapping for the POS tagset and the other lexical features. I will check with the team responsible for the morphological analyzer about the confidence level or the ranking of multiple readings as I'm not sure about that.

Concerning the missing readings for some lexical entries it is because the unrecognized term are not present in the lexicon of the morphological analyzer; they are "unknown" words so to say. It happens with mispelled words or unknown named entities. It is possible to explicitly set a POS "Unknown" lexical feature for them, if you wish so, but there are no lexical feature retrieved by the morphological analyzer itself. Let me know if you want this update as well. Calling the named entities engine for Italian may be an alternative way for getting more info on that textual fragments.

I will send you an update next week as soon as I finished to integrate the updates


Bests
    Alessio

On 09/21/2012 09:16 AM, Rupert Westenthaler wrote:
Hi Alessio, all

I have started to work on the migration of the CELI lemmatizer Engine
to the new Stanbol NLP processing module (STANBOL-733, STANBOL-738).
Basically the Idea was to adapt the Lemmatizer Engine to use the
AnalysedText ContentPart (STANBOL-734) to store its result. The goal
of this work is being able to use word level NLP analyses result of
CELI in Apache Stanbol (e.g. CELI POS tags and lemma information for
looking up terms with the KeywordLinkingEngine). Achieving this would
open up a lot of additional possibilities for Stanbol Users that want
to use the CELI services.

While working on this I came across the following things:

(1) I recognized that the Lemmatizer Service does not provide
information for all Words (LexicalEntry). As an example in the
sentence

     Lo scandalo dei fondi pubblici sperperati in allegria dalla Regione
     Lazio ha dato i primi frutti: ieri il capogruppo Pdl Francesco Battistoni
     si è dimesso e la sede del Consiglio è stata invasa dalla Guardia
di Finanza.

the LexicalEntries for "Pdl Francesco Battistoni si" do not have any
metadata (no <Reading>). Do you know why this is the case? Is their a
possibility to obtain LexicalFeatures for all words?

(2) The Stanbol NLP processing module maps POS tag sets used by NLP
processing frameworks to Morphosyntactic Categories defined by the
OLIA ontology [1]. Uses Categories are defined by the LexicalCategory
enumeration [2]. Actual POS tags are represented by the PosTag class
[3] that provides (1) the tag as string and optionally (2) the
LexicalCategory. While LexicalCategories are optional they are
important as they allow other components to determine the type of a
word in an language independent way. Because of that it would be
important to map the POS tag sets used by CELI to the
LexicalCategories used by the Stanbol NLP processing module. Can you
point me to documentation of the POS tag sets used by CELI for the
different languages?

The following code snippet shows how such a mapping could look like for Italian:

     public static final TagSet<PosTag> ITALIEN = new
TagSet<PosTag>("CELI Italian","it");

     static {
         DEFAULT.addTag(new PosTag("ADJ",LexicalCategory.Adjective));
         DEFAULT.addTag(new PosTag("ADV",LexicalCategory.Adverb));
         DEFAULT.addTag(new PosTag("ART",LexicalCategory.PronounOrDeterminer));
         DEFAULT.addTag(new PosTag("CLI")); //mapping ??
         DEFAULT.addTag(new PosTag("CONJ",LexicalCategory.Conjuction));
         DEFAULT.addTag(new PosTag("PREP",LexicalCategory.Adposition));
         DEFAULT.addTag(new PosTag("NF",LexicalCategory.Noun));
         DEFAULT.addTag(new PosTag("NM",LexicalCategory.Noun));
         DEFAULT.addTag(new PosTag("V",LexicalCategory.Verb));
         getInstance().add(DEFAULT);
     }

BTW I would be also interested in mappings of the other
LexicalFeatures extracted by CELI to the OLIA ontology (e.g. GENDER ->
olia:GenderFeature, NUMBER -> olia:NumberFeature, VERB_TENSE ->
olia:TenseFeature, ...).

(3) The Lemmatizer Engine does not provide confidence (probabilities)
for the extracted Features. If those information are available it
would be great to have them available. Otherwise can I assume that the
things mentioned first in the XML file do have a higher probability as
additional options (e.g. <LexicalEntry> with multiple <Reading>)?

The code related to STANBOL-733 is developed in the
"stanbol-nlp-processing" branch

     svn.apache.org/repos/asf/stanbol/branches/stanbol-nlp-processing/

best
Rupert Westenthaler



[1] http://purl.org/olia/olia.owl
[2] 
http://svn.apache.org/repos/asf/stanbol/branches/stanbol-nlp-processing/enhancer/generic/nlp/src/main/java/org/apache/stanbol/enhancer/nlp/pos/LexicalCategory.java
[3] 
http://svn.apache.org/repos/asf/stanbol/branches/stanbol-nlp-processing/enhancer/generic/nlp/src/main/java/org/apache/stanbol/enhancer/nlp/pos/PosTag.java


--
*************************************
Alessio Bosca, Ph.D.
CELI s.r.l.
Via San Quintino 31
10121 Torino
Tel. +39 011.562.71.15
Fax +39 011.506.40.86
http://www.celi.it
*************************************


Reply via email to