Hi all, FYI, Joseph provided a detailed report about his problem. A first look indicates that this problems could potentially be a bug introduced with STANBOL-1049 [1] however I had not yet time to look into this as I was traveling for the last 7 days.
best Rupert [1] https://issues.apache.org/jira/browse/STANBOL-1049 On Mon, May 6, 2013 at 10:48 AM, Joseph M'Bimbi-Bene <[email protected]> wrote: > i thought it might be a bug in the absence of POS tagging, etc. so i used > Talismane for NLP tasks, i configured the EnitytihubLinkingEngine to link > adjectives since it is what Talismane tags "mario" as, but it doesn't > change anything. here are the logs* > > .EntityLinker --- preocess Token 117: *moustachu *(lemma: none | > pos:[Value [pos: ADJ(olia:Adjective)].prob=0.4520518431389538]) chunk: none > .EntityLinker - 116:'*plombier'* (lemma: none | pos:[Value [pos: > NC(olia:CommonNoun|olia:Noun)].prob=0.6784572817881412]).EntityLinker + > 118:'supérieure' (lemma: none | pos:[Value [pos: > ADJ(olia:Adjective)].prob=0.9366843193563169]).EntityLinker >> > searchStrings *[moustachu, supérieure]*.EntityLinker - found 1 > entities ....EntityLinker >> http://example.org/resource/Mario (ranking: null).MainLabelTokenizer > > use Tokenizer class > org.apache.stanbol.enhancer.engines.entitylinking.labeltokenizer.opennlp.OpenNlpLabelTokenizer > for language null > .MainLabelTokenizer - tokenized le plombier moustachu -> *[le, plombier, > moustachu]* > .MainLabelTokenizer > use Tokenizer class > org.apache.stanbol.enhancer.engines.entitylinking.labeltokenizer.opennlp.OpenNlpLabelTokenizer > for language null. > MainLabelTokenizer - tokenized Mario -> [Mario].EntityLinker - *no match* > > why isn't "plombier" in "searchstrings" ? even if i configured the engine > so that adjective are linkable tokens, according to the documentation, > "plombier" should be a "matchable token". The behavior of this engine is > quite disturbing ... > > > 2013/5/6 Joseph M'Bimbi-Bene <[email protected]> > >> Hello everybody, i'm having some problems with the EntityhubLinkingEngine. >> Before about 2 weeks ago, i used it for NER tasks on a custom vocabulary >> and it worked fine. now I cannot spot entities with label on several words >> (even with the parameter lmmtip in "languages configuration" and it now >> seems to be case sensitive, even if configured not to be. >> >> Here is what my entity looks like >> >> <rdf:Description rdf:about="http://example.org/resource#Mario"> >> <skos:prefLabel>Mario</skos: prefLabel> >> <skos:altLabel>le plombier moustachu</skos:altLabel> >> <rdf:type>http://example.org/concept#gentil</rdf:type> >> <rdf:type>http://example.org/concept#humain</rdf:type> >> </rdf:Description> >> >> And i want to spot it with the mention "plombier moustachu". >> here is a log illustrating what i used to have : >> >> 18.04.2013 14:37:15.794 *DEBUG* [Thread-303] >> org.apache.stanbol.enhancer.engines.entitylinking.impl.EntityLinker --- >> preocess Token 825: plombier (lemma: none | pos:[]) chunk: none >> >> 18.04.2013 14:37:15.794 *DEBUG* [Thread-303] >> org.apache.stanbol.enhancer.engines.entitylinking.impl.EntityLinker - >> 824:'le' (lemma: none | pos:[]) >> >> 18.04.2013 14:37:15.794 *DEBUG* [Thread-303] >> org.apache.stanbol.enhancer.engines.entitylinking.impl.EntityLinker + >> 826:'moustachu' (lemma: none | pos:[]) >> >> 18.04.2013 14:37:15.794 *DEBUG* [Thread-303] >> org.apache.stanbol.enhancer.engines.entitylinking.impl.EntityLinker >> >> searchStrings >> [plombier, moustachu] >> >> 18.04.2013 14:37:15.794 *DEBUG* [Thread-303] >> org.apache.stanbol.enhancer.engines.entitylinking.impl.EntityLinker - >> found 1 entities ... >> >> 18.04.2013 14:37:15.794 *DEBUG* [Thread-303] >> org.apache.stanbol.enhancer.engines.entitylinking.impl.EntityLinker > >> http://example.org/resource#Mario >> >> 18.04.2013 14:37:15.794 *DEBUG* [Thread-303] >> org.apache.stanbol.enhancer.engines.entitylinking.impl.EntityLinker < >> le plombier moustachu[m=FULL,s=3,c=3(1.0)/3] score=1.0[l=1.0,t=1.0] for >> http://example.org/resource#Mario >> >> 18.04.2013 14:37:15.794 *DEBUG* [Thread-303] >> org.apache.stanbol.enhancer.engines.entitylinking.impl.EntityLinker >> >> Suggestions: >> 18.04.2013 14:37:15.794 *DEBUG* [Thread-303] >> org.apache.stanbol.enhancer.engines.entitylinking.impl.EntityLinker - 0: >> le plombier moustachu[m=FULL,s=3,c=3(1.0)/3] score=1.0[l=1.0,t=1.0] for >> http://example.org/resource#Mario >> >> and here is what i now have: >> here with the processing of the token "plombier" >> >> EntityLinker --- *preocess Token 17: plombier* (lemma: none | pos:[]) >> chunk: none >> EntityLinker - 16:'le' (lemma: none | pos:[]) >> EntityLinker - 18*:'moustachu'* (lemma: none | pos:[]) >> EntityLinker - 15:'sont' (lemma: none | pos:[]) >> EntityLinker - 19:'des' (lemma: none | pos:[]) >> EntityLinker - 14:',' (lemma: none | pos:[]) >> EntityLinker - 20:'collines' (lemma: none | pos:[]) >> EntityLinker >> *searchStrings [plombier]* >> .MainLabelTokenizer > use Tokenizer class >> org.apache.stanbol.enhancer.engines.entitylinking.labeltokenizer.opennlp.OpenNlpLabelTokenizer >> for language null >> MainLabelTokenizer - tokenized le plombier moustachu ->* **[le, >> plombier, moustachu]* >> MainLabelTokenizer > use Tokenizer class >> org.apache.stanbol.enhancer.engines.entitylinking.labeltokenizer.opennlp.OpenNlpLabelTokenizer >> for language null >> MainLabelTokenizer - tokenized Mario -> [Mario] >> EntityLinker -* **no match * >> >> why isn't "plombier" or moustachu" in the searchstring, just as before ? >> and now with the processing of "mario" >> >> .EntityLinker --- preocess Token 16: *mario* (lemma: none | pos:[]) >> chunk: none >> .EntityLinker - 15:'sont' (lemma: none | pos:[]) >> .EntityLinker - 17:'des' (lemma: none | pos:[]) >> .EntityLinker - 14:',' (lemma: none | pos:[]) >> .EntityLinker - 18:'collines' (lemma: none | pos:[]) >> .EntityLinker - 13:'mendips' (lemma: none | pos:[]) >> .EntityLinker - 19:'situées' (lemma: none | pos:[]) >> .EntityLinker >> searchStrings *[mario]* >> .EntityLinker - found 1 entities ... >> .EntityLinker > http://example.org/resource/Mario (ranking: null) >> .MainLabelTokenizer > use Tokenizer class >> org.apache.stanbol.enhancer.engines.entitylinking.labeltokenizer.opennlp.OpenNlpLabelTokenizer >> for language null >> .MainLabelTokenizer - tokenized le plombier moustachu -> [le, plombier, >> moustachu] >> .MainLabelTokenizer > use Tokenizer class >> org.apache.stanbol.enhancer.engines.entitylinking.labeltokenizer.opennlp.OpenNlpLabelTokenizer >> for language null >> .MainLabelTokenizer - tokenized Mario -> *[Mario]* >> .EntityLinker - *no match* >> >> why isn't "mario" matched against "Mario", i configured the engine so >> thtat it is not case sensitive >> >> as you can see, in the MaxTokenSearchDistance, i still have "le" and >> "moustachu" tokens but it doesn't go in the SearchString for lookup. In the >> result of the enhancement is now pretty bad. What is going on ? >> >> Thank you a lot in advance >> -- | Rupert Westenthaler [email protected] | Bodenlehenstraße 11 ++43-699-11108907 | A-5500 Bischofshofen
