Hi Joseph, you are right the 'Upper Case Token Mode' interferes with the configured UpperCase mode. Maybe it would be better to remove the 'Upper Case Token Mode' parameter introduced by STANBOL-1049 and implement a similar functionality by using the existing "Upper Case" parameter. But I am not yet completely sure if this is possible. I any case I will link your previous mail with this issue and not this as an unresolved issue for STANBOL-1049.
I think in you specific case it would be best to use a very low probability setting (e.g. prop=0.001) as it seams that a lot of the suggestions of Talismane are ok, even if they do have a very low probability. This would avoid the "unknown POS tag fallback" to take effect and therefore workaround the described issues. In addition you should consider to activate case sensitive matching. This would also ensure that 'La' in the text is NOT matched with 'LA' in the controlled vocabulary. Let me also add something about Upper Case and sentence start. On Mon, Jun 3, 2013 at 11:50 AM, Joseph M'Bimbi-Bene <jbi...@object-ive.com> wrote: > Here is the configuration of my linking engine: > > *;lmmtip;uc=NONE;lc=Noun;prop=0.55;pprob=0.75 > > Since I didn't want to have determiner to be linkable when they are > uppercased at the beginning of a sentence, i explicitely specified > uppercase tokens to not be treated specifically. Upper Case Tokens at the beginning of sentences or sub-sentences (e.g at the begin of a quote) are ignored. So a 'La' at the beginning of a sentence MUST NOT be considered as an upper case token. So if you se 'La' to be linked at a sentence start, than this would indicate that the sentence detection does not work probably. Can you sent the text sample you used, so that I can check why Talismane fails to correctly split the sentences. best Rupert > Here are some log excerpts: > > On token 'La', which is (i think) a determiner, anyway, definitely not a > Noun : > > ProcessingState > *15: Token: [1087, 1089] La* (pos:[Value [pos: * > ADJ(olia:Adjective)].prob=0.016871281997002517*]) chunk: 'none' > ProcessingState - TokenData: 'La'[linkable=true(linkabkePos=null)| > matchable=true(matchablePos=null)| alpha=true| seachLength=true| > upperCase=true] > > EntityLinker --- *preocess Token 15: La* (lemma: null) linkable=true, > matchable=true | chunk: none > Here it says that La is the 15th token of the Sentence. This is the reason why it is marked as linkable. > EntityLinker + 14:'cognitives.' (lemma: null) linkable=true, matchable=true > > EntityLinker + 16:'recherche' (lemma: null) linkable=true, matchable=true > > EntityLinker >> searchStrings [La, recherche] > > EntityLinker - found 1 entities ... > > EntityLinker > http://www.edf.fr/EdfAcronyme.owl#LA (ranking: null) > > MainLabelTokenizer > use Tokenizer class > org.apache.stanbol.enhancer.engines.entitylinking.labeltokenizer.lucene.LuceneLabelTokenizer > for language null > > 03.06.2013 11:11:30.809 *TRACE* [Thread-5419] > org.apache.stanbol.enhancer.engines.entitylinking.labeltokenizer.lucene.LuceneLabelTokenizer > Language null not configured to be supported > > MainLabelTokenizer > use Tokenizer class > org.apache.stanbol.enhancer.engines.entitylinking.labeltokenizer.lucene.LuceneLabelTokenizer > for language null > > 03.06.2013 11:11:30.809 *TRACE* [Thread-5419] > org.apache.stanbol.enhancer.engines.entitylinking.labeltokenizer.lucene.LuceneLabelTokenizer > Language null not configured to be supported > > MainLabelTokenizer > use Tokenizer class > org.apache.stanbol.enhancer.engines.entitylinking.labeltokenizer.opennlp.OpenNlpLabelTokenizer > for language null > > MainLabelTokenizer - tokenized la -> [la] > > EntityLinker + LA[m=FULL,s=1,c=1(1.0)/1] score=1.0[l=1.0,t=1.0] for > http://www.edf.fr/EdfAcronyme.owl#LA ranking: null > EntityLinker >> Suggestions:EntityLinker - 0: LA[m=FULL,s=1,c=1(1.0)/1] > score=1.0[l=1.0,t=1.0] for http://www.edf.fr/EdfAcronyme.owl#LA ranking: > null > > > Then i went to the page of the jira issue 1049 and i guessed my token > corresponded to "unknown POS tag rule". > "TextProcessingConfig#linkOnlyUpperCaseTokensWithMissingPosTag" -> does > this have anything to do with the the *Upper Case Token Mode *parameter ?* > * > Since my tokens 'La' are always at the beginning of the sentence, i guessed > they falled in the category: > "else - lower case token or sentence or sub-sentence start > * tokens equals or longer as > TextProcessingConfig#minSearchTokenLength are marked as matchable" > > I don't understand that rule: is that supposed to override the *Upper Case > Token Mode *parameter ? Anyway i tried with all 'La' lowercased, ie to 'la' > and the tokens 'la are never processed. Here is the log excerpt: > > ProcessingState > *15: Token: [1087, 1089] la* (pos:[Value [pos: > DET(olia:Determiner|olia:PronounOrDeterminer)].prob=0.9445673708042409]) > chunk: 'none' > > ProcessingState - TokenData: 'la'[linkable=false(*linkabkePos=false*)| > matchable=false(*matchablePos=false*)| alpha=true| seachLength=true| > upperCase=false] > > > After i few minutes of reflexion, i see that linkabkePos and matchablePos are > no longer equals to "null". What is the rule to set them to null or not. It > is strange that just an uppercase can change the POS tag of the token that > drastically for Talismane but i cannot do anything about it. I still have > the interrogation about the supposed overriding of the *Upper Case Token > Mode *parameter for "unknown POS tag rule". > > > > On a quite related topic, the *Upper Case Token Mode *parameter doesn't > seem to behave properly (or i missed something). i let "uc=NONE" in the > config of the engine and monitored the processing of the token, here are > the logs. On the token "utilisée" for the text: "AE est une mesure > couramment utilisée." > > ProcessingState > 5: Token: [543, 551] utilisée (pos:[Value [pos: > VPP(olia:PastParticiple|olia:Verb)].prob=0.9864354941576942]) chunk: 'none' > ProcessingState - TokenData: > 'utilisée'[linkable=false(linkabkePos=false)| > matchable=false(matchablePos=false)| alpha=true| seachLength=true| > upperCase=false] > > token is not processed, which i am fine with since its POS tag is VPP > > > Now On the token "Utilisée" for the text: "AE est une mesure couramment > Utilisée." > ProcessingState > 5: Token: [543, 551] Utilisée (pos:[Value [*pos: NPP* > (olia:ProperNoun|olia:Noun)].*prob=0.19181597467804898*]) chunk: 'none' > ProcessingState - TokenData: > 'Utilisée'[linkable=true(linkabkePos=null)| > matchable=true(matchablePos=null)| alpha=true| seachLength=true| > upperCase=true] > > so the POS tag is OK, but the prob doesn't reach the threshold (which i set > to 0.55), here is the log of the processing of the token > > EntityLinker --- preocess Token 5: Utilisée (lemma: null) linkable=true, > matchable=true | chunk: none > EntityLinker - 4:'couramment' (lemma: null) linkable=false, > matchable=false > EntityLinker - 6:'.' (lemma: null) linkable=false, matchable=false > EntityLinker + 3:'mesure' (lemma: null) linkable=true, matchable=true > EntityLinker >> searchStrings [mesure, Utilisée] > > is it a problem of processing of POS tagging, of UpperCase linking or did i > misunderstood something. > > Thank you for the time you spend helping us users, it is very appreciated. > best regard, Joseph > > 2013/6/3 Rupert Westenthaler <rupert.westentha...@gmail.com> > >> Hi Joseph >> >> On Mon, Jun 3, 2013 at 10:01 AM, Joseph M'Bimbi-Bene >> <jbi...@object-ive.com> wrote: >> > I think it is the tokenizing process of Talismane NLP, since my >> enhancement >> > chain is : >> > -langdetect >> > -talismaneNLP >> > -MyVocabulary >> > >> >> I also used Talismane when testing and I was not seeing tokens like that >> >> Here are an excerpt of my log (with minSearchTokenLength set to 2) >> >> --- preocess Token 11: AE (lemma: null) linkable=true, matchable=true >> | chunk: none >> - 10:'*' (lemma: null) linkable=false, matchable=false >> - 12:'*' (lemma: null) linkable=false, matchable=false >> - 9:'indiquant' (lemma: null) linkable=false, matchable=false >> - 13:'une' (lemma: null) linkable=false, matchable=false >> - 8:')' (lemma: null) linkable=false, matchable=false >> + 14:'servitude' (lemma: null) linkable=false, matchable=true >> >> searchStrings [AE, servitude] >> >> best >> Rupert >> >> >> -- >> | Rupert Westenthaler rupert.westentha...@gmail.com >> | Bodenlehenstraße 11 ++43-699-11108907 >> | A-5500 Bischofshofen >> -- | Rupert Westenthaler rupert.westentha...@gmail.com | Bodenlehenstraße 11 ++43-699-11108907 | A-5500 Bischofshofen