Re: Problem with entityLinking on Uppercase tokens

Rupert Westenthaler Mon, 03 Jun 2013 05:25:12 -0700

Hi Joseph,

you are right the 'Upper Case Token Mode' interferes with the
configured UpperCase mode. Maybe it would be better to remove the
'Upper Case Token Mode' parameter introduced by STANBOL-1049 and
implement a similar functionality by using the existing "Upper Case"
parameter. But I am not yet completely sure if this is possible. I any
case I will link your previous mail with this issue and not this as an
unresolved issue for STANBOL-1049.


I think in you specific case it would be best to use a very low
probability setting (e.g. prop=0.001) as it seams that a lot of the
suggestions of Talismane are ok, even if they do have a very low
probability. This would avoid the "unknown POS tag fallback" to take
effect and therefore workaround the described issues.

In addition you should consider to activate case sensitive matching.
This would also ensure that 'La' in the text is NOT matched with 'LA'
in the controlled vocabulary.

Let me also add something about Upper Case and sentence start.

On Mon, Jun 3, 2013 at 11:50 AM, Joseph M'Bimbi-Bene
<[email protected]> wrote:
> Here is the configuration of my linking engine:
>
> *;lmmtip;uc=NONE;lc=Noun;prop=0.55;pprob=0.75
>
> Since I didn't want to have determiner to be linkable when they are
> uppercased at the beginning of a sentence, i explicitely specified
> uppercase tokens to not be treated specifically.

Upper Case Tokens at the beginning of sentences or sub-sentences (e.g
at the begin of a quote) are ignored. So a 'La' at the beginning of a
sentence MUST NOT be considered as an upper case token. So if you se
'La' to be linked at a sentence start, than this would indicate that
the sentence detection does not work probably.

Can you sent the text sample you used, so that I can check why
Talismane fails to correctly split the sentences.

best
Rupert

> Here are some log excerpts:
>
> On token 'La', which is (i think) a determiner, anyway, definitely not a
> Noun :
>
> ProcessingState > *15: Token: [1087, 1089] La* (pos:[Value [pos: *
> ADJ(olia:Adjective)].prob=0.016871281997002517*]) chunk: 'none'
> ProcessingState - TokenData: 'La'[linkable=true(linkabkePos=null)|
> matchable=true(matchablePos=null)| alpha=true| seachLength=true|
> upperCase=true]
>
> EntityLinker --- *preocess Token 15: La* (lemma: null) linkable=true,
> matchable=true | chunk: none
>

Here it says that La is the 15th token of the Sentence. This is the
reason why it is marked as linkable.


> EntityLinker + 14:'cognitives.' (lemma: null) linkable=true, matchable=true
>
> EntityLinker + 16:'recherche' (lemma: null) linkable=true, matchable=true
>
> EntityLinker >> searchStrings [La, recherche]
>
> EntityLinker - found 1 entities ...
>
> EntityLinker > http://www.edf.fr/EdfAcronyme.owl#LA (ranking: null)
>
> MainLabelTokenizer > use Tokenizer class
> org.apache.stanbol.enhancer.engines.entitylinking.labeltokenizer.lucene.LuceneLabelTokenizer
> for language null
>
> 03.06.2013 11:11:30.809 *TRACE* [Thread-5419]
> org.apache.stanbol.enhancer.engines.entitylinking.labeltokenizer.lucene.LuceneLabelTokenizer
> Language null not configured to be supported
>
> MainLabelTokenizer > use Tokenizer class
> org.apache.stanbol.enhancer.engines.entitylinking.labeltokenizer.lucene.LuceneLabelTokenizer
> for language null
>
> 03.06.2013 11:11:30.809 *TRACE* [Thread-5419]
> org.apache.stanbol.enhancer.engines.entitylinking.labeltokenizer.lucene.LuceneLabelTokenizer
> Language null not configured to be supported
>
> MainLabelTokenizer > use Tokenizer class
> org.apache.stanbol.enhancer.engines.entitylinking.labeltokenizer.opennlp.OpenNlpLabelTokenizer
> for language null
>
> MainLabelTokenizer - tokenized la -> [la]
>
> EntityLinker + LA[m=FULL,s=1,c=1(1.0)/1] score=1.0[l=1.0,t=1.0] for
> http://www.edf.fr/EdfAcronyme.owl#LA ranking: null
> EntityLinker >> Suggestions:EntityLinker - 0: LA[m=FULL,s=1,c=1(1.0)/1]
> score=1.0[l=1.0,t=1.0] for http://www.edf.fr/EdfAcronyme.owl#LA ranking:
> null
>
>
> Then i went to the page of the jira issue 1049 and i guessed my token
> corresponded to "unknown POS tag rule".
> "TextProcessingConfig#linkOnlyUpperCaseTokensWithMissingPosTag" -> does
> this have anything to do with  the the *Upper Case Token Mode *parameter ?*
> *
> Since my tokens 'La' are always at the beginning of the sentence, i guessed
> they falled in the category:
> "else - lower case token or sentence or sub-sentence start
>         * tokens equals or longer as
> TextProcessingConfig#minSearchTokenLength are marked as matchable"
>
> I don't understand that rule: is that supposed to override the *Upper Case
> Token Mode *parameter ? Anyway i tried with all 'La' lowercased, ie to 'la'
> and the tokens 'la are never processed. Here is the log excerpt:
>
> ProcessingState > *15: Token: [1087, 1089] la* (pos:[Value [pos:
> DET(olia:Determiner|olia:PronounOrDeterminer)].prob=0.9445673708042409])
> chunk: 'none'
>
> ProcessingState - TokenData: 'la'[linkable=false(*linkabkePos=false*)|
> matchable=false(*matchablePos=false*)| alpha=true| seachLength=true|
> upperCase=false]
>
>
> After i few minutes of reflexion, i see that linkabkePos and matchablePos are
> no longer equals to "null". What is the rule to set them to null or not. It
> is strange that just an uppercase can change the POS tag of the token that
> drastically for Talismane but i cannot do anything about it. I still have
> the interrogation about the supposed overriding of the *Upper Case Token
> Mode *parameter for "unknown POS tag rule".
>
>
>
> On a quite related topic, the *Upper Case Token Mode *parameter doesn't
> seem to behave properly (or i missed something). i let "uc=NONE" in the
> config of the engine and monitored the processing of the token, here are
> the logs. On the token "utilisée" for the text: "AE est une mesure
> couramment utilisée."
>
> ProcessingState   > 5: Token: [543, 551] utilisÃ©e (pos:[Value [pos:
> VPP(olia:PastParticiple|olia:Verb)].prob=0.9864354941576942]) chunk: 'none'
> ProcessingState     - TokenData:
> 'utilisÃ©e'[linkable=false(linkabkePos=false)|
> matchable=false(matchablePos=false)| alpha=true| seachLength=true|
> upperCase=false]
>
> token is not processed, which i am fine with since its POS tag is VPP
>
>
> Now On the token "Utilisée" for the text: "AE est une mesure couramment
> Utilisée."
> ProcessingState   > 5: Token: [543, 551] UtilisÃ©e (pos:[Value [*pos: NPP*
> (olia:ProperNoun|olia:Noun)].*prob=0.19181597467804898*]) chunk: 'none'
> ProcessingState     - TokenData:
> 'UtilisÃ©e'[linkable=true(linkabkePos=null)|
> matchable=true(matchablePos=null)| alpha=true| seachLength=true|
> upperCase=true]
>
> so the POS tag is OK, but the prob doesn't reach the threshold (which i set
> to 0.55), here is the log of the processing of the token
>
> EntityLinker --- preocess Token 5: UtilisÃ©e (lemma: null) linkable=true,
> matchable=true | chunk: none
> EntityLinker     - 4:'couramment' (lemma: null) linkable=false,
> matchable=false
> EntityLinker     - 6:'.' (lemma: null) linkable=false, matchable=false
> EntityLinker     + 3:'mesure' (lemma: null) linkable=true, matchable=true
> EntityLinker   >> searchStrings [mesure, UtilisÃ©e]
>
> is it a problem of processing of POS tagging, of UpperCase linking or did i
> misunderstood something.
>
> Thank you for the time you spend helping us users, it is very appreciated.
> best regard, Joseph
>
> 2013/6/3 Rupert Westenthaler <[email protected]>
>
>> Hi Joseph
>>
>> On Mon, Jun 3, 2013 at 10:01 AM, Joseph M'Bimbi-Bene
>> <[email protected]> wrote:
>> > I think it is the tokenizing process of Talismane NLP, since my
>> enhancement
>> > chain is :
>> > -langdetect
>> > -talismaneNLP
>> > -MyVocabulary
>> >
>>
>> I also used Talismane when testing and I was not seeing tokens like that
>>
>> Here are an excerpt of my log (with minSearchTokenLength set to 2)
>>
>> --- preocess Token 11: AE (lemma: null) linkable=true, matchable=true
>> | chunk: none
>>     - 10:'*' (lemma: null) linkable=false, matchable=false
>>     - 12:'*' (lemma: null) linkable=false, matchable=false
>>     - 9:'indiquant' (lemma: null) linkable=false, matchable=false
>>     - 13:'une' (lemma: null) linkable=false, matchable=false
>>     - 8:')' (lemma: null) linkable=false, matchable=false
>>     + 14:'servitude' (lemma: null) linkable=false, matchable=true
>>   >> searchStrings [AE, servitude]
>>
>> best
>> Rupert
>>
>>
>> --
>> | Rupert Westenthaler             [email protected]
>> | Bodenlehenstraße 11                             ++43-699-11108907
>> | A-5500 Bischofshofen
>>



--
| Rupert Westenthaler             [email protected]
| Bodenlehenstraße 11                             ++43-699-11108907
| A-5500 Bischofshofen

Re: Problem with entityLinking on Uppercase tokens

Reply via email to