Hi, forgot to include the dev list in my last response to Alessio, hence the forward
On Fri, Sep 21, 2012 at 3:22 PM, Rupert Westenthaler <[email protected]> wrote: > Hi Alessio, > > On Fri, Sep 21, 2012 at 12:51 PM, Alessio Bosca <[email protected]> wrote: >> We are surely willing to contribute to the development of the engines and I >> will work on the requested modifications for supporting the AnalyzedText >> content part. > > Thats cool to hear. I already started some thinks. I will commit those > later today so that you can continue from their. > >> We will also provide you a mapping for the POS tagset and the other lexical >> features. > > If there is a documentation of the POS Tag Sets are available it would > be cool if you could link those. When I commit my local changes there > will be a "PosTagSetRegistry" in > "org.apache.stanbol.enhancer.engines.celi" where you can add the > mappings. > >>I will check with the team responsible for the morphological >> analyzer about the confidence level or the ranking of multiple readings as >> I'm not sure about that. >> >> Concerning the missing readings for some lexical entries it is because the >> unrecognized term are not present in the lexicon of the morphological >> analyzer; they are "unknown" words so to say. >> It happens with mispelled words or unknown named entities. It is possible to >> explicitly set a POS "Unknown" lexical feature for them, if you wish so, but >> there are no lexical feature retrieved by the morphological analyzer itself. >> Let me know if you want this update as well. >> Calling the named entities engine for Italian may be an alternative way for >> getting more info on that textual fragments. >> > > OK that explains a lot. I had the impression that there is first a POS > tagger and than a morphological analyzer uses those results to provide > the lemmas and other information. If the morphological analyzer adds > possible lemmas based on words I would expect that there are no > results for some words and also that there are multiple readings for > others. > > Does linguagrid also have a POS tagging service? > >> I will send you an update next week as soon as I finished to integrate the >> updates >> > > I am in Leibzig next week so I might be not as responsive as usually. > > best > Rupert > >> >> Bests >> Alessio >> >> >> On 09/21/2012 09:16 AM, Rupert Westenthaler wrote: >>> >>> Hi Alessio, all >>> >>> I have started to work on the migration of the CELI lemmatizer Engine >>> to the new Stanbol NLP processing module (STANBOL-733, STANBOL-738). >>> Basically the Idea was to adapt the Lemmatizer Engine to use the >>> AnalysedText ContentPart (STANBOL-734) to store its result. The goal >>> of this work is being able to use word level NLP analyses result of >>> CELI in Apache Stanbol (e.g. CELI POS tags and lemma information for >>> looking up terms with the KeywordLinkingEngine). Achieving this would >>> open up a lot of additional possibilities for Stanbol Users that want >>> to use the CELI services. >>> >>> While working on this I came across the following things: >>> >>> (1) I recognized that the Lemmatizer Service does not provide >>> information for all Words (LexicalEntry). As an example in the >>> sentence >>> >>> Lo scandalo dei fondi pubblici sperperati in allegria dalla Regione >>> Lazio ha dato i primi frutti: ieri il capogruppo Pdl Francesco >>> Battistoni >>> si è dimesso e la sede del Consiglio è stata invasa dalla Guardia >>> di Finanza. >>> >>> the LexicalEntries for "Pdl Francesco Battistoni si" do not have any >>> metadata (no <Reading>). Do you know why this is the case? Is their a >>> possibility to obtain LexicalFeatures for all words? >>> >>> (2) The Stanbol NLP processing module maps POS tag sets used by NLP >>> processing frameworks to Morphosyntactic Categories defined by the >>> OLIA ontology [1]. Uses Categories are defined by the LexicalCategory >>> enumeration [2]. Actual POS tags are represented by the PosTag class >>> [3] that provides (1) the tag as string and optionally (2) the >>> LexicalCategory. While LexicalCategories are optional they are >>> important as they allow other components to determine the type of a >>> word in an language independent way. Because of that it would be >>> important to map the POS tag sets used by CELI to the >>> LexicalCategories used by the Stanbol NLP processing module. Can you >>> point me to documentation of the POS tag sets used by CELI for the >>> different languages? >>> >>> The following code snippet shows how such a mapping could look like for >>> Italian: >>> >>> public static final TagSet<PosTag> ITALIEN = new >>> TagSet<PosTag>("CELI Italian","it"); >>> >>> static { >>> DEFAULT.addTag(new PosTag("ADJ",LexicalCategory.Adjective)); >>> DEFAULT.addTag(new PosTag("ADV",LexicalCategory.Adverb)); >>> DEFAULT.addTag(new >>> PosTag("ART",LexicalCategory.PronounOrDeterminer)); >>> DEFAULT.addTag(new PosTag("CLI")); //mapping ?? >>> DEFAULT.addTag(new PosTag("CONJ",LexicalCategory.Conjuction)); >>> DEFAULT.addTag(new PosTag("PREP",LexicalCategory.Adposition)); >>> DEFAULT.addTag(new PosTag("NF",LexicalCategory.Noun)); >>> DEFAULT.addTag(new PosTag("NM",LexicalCategory.Noun)); >>> DEFAULT.addTag(new PosTag("V",LexicalCategory.Verb)); >>> getInstance().add(DEFAULT); >>> } >>> >>> BTW I would be also interested in mappings of the other >>> LexicalFeatures extracted by CELI to the OLIA ontology (e.g. GENDER -> >>> olia:GenderFeature, NUMBER -> olia:NumberFeature, VERB_TENSE -> >>> olia:TenseFeature, ...). >>> >>> (3) The Lemmatizer Engine does not provide confidence (probabilities) >>> for the extracted Features. If those information are available it >>> would be great to have them available. Otherwise can I assume that the >>> things mentioned first in the XML file do have a higher probability as >>> additional options (e.g. <LexicalEntry> with multiple <Reading>)? >>> >>> The code related to STANBOL-733 is developed in the >>> "stanbol-nlp-processing" branch >>> >>> svn.apache.org/repos/asf/stanbol/branches/stanbol-nlp-processing/ >>> >>> best >>> Rupert Westenthaler >>> >>> >>> >>> [1] http://purl.org/olia/olia.owl >>> [2] >>> http://svn.apache.org/repos/asf/stanbol/branches/stanbol-nlp-processing/enhancer/generic/nlp/src/main/java/org/apache/stanbol/enhancer/nlp/pos/LexicalCategory.java >>> [3] >>> http://svn.apache.org/repos/asf/stanbol/branches/stanbol-nlp-processing/enhancer/generic/nlp/src/main/java/org/apache/stanbol/enhancer/nlp/pos/PosTag.java >> >> >> >> -- >> ************************************* >> Alessio Bosca, Ph.D. >> CELI s.r.l. >> Via San Quintino 31 >> 10121 Torino >> Tel. +39 011.562.71.15 >> Fax +39 011.506.40.86 >> http://www.celi.it >> ************************************* >> >> > > > > -- > | Rupert Westenthaler [email protected] > | Bodenlehenstraße 11 ++43-699-11108907 > | A-5500 Bischofshofen -- | Rupert Westenthaler [email protected] | Bodenlehenstraße 11 ++43-699-11108907 | A-5500 Bischofshofen
