Re: Problem with entityLinking on Uppercase tokens

Joseph M'Bimbi-Bene Mon, 03 Jun 2013 06:46:01 -0700

Thank you for your quick answer.
Here is the text i used:

La recherche d'information (RI1) est le domaine qui étudie la manière de
retrouver des informations dans un corpus. Celui-ci est composé de
documents d'une ou plusieurs bases de données, qui sont décrits par un
contenu ou les métadonnées associées. Les bases de données peuvent être
relationnelles ou non structurées, telles celles mises en réseau par des
liens hypertexte comme dans le World Wide Web, l'internet et les intranets.
Le contenu des documents peut être du texte, des sons, ses images ou des
données. AE est une mesure couramment utilisée.


La recherche d'information est historiquement liée aux sciences de
l'information et à la bibliothéconomie qui visent à représenter des
documents dans le but d'en récupérer des informations, au moyen de la
construction d’index. L’informatique a permis le développement d’outils
pour traiter l’information et établir la représentation des documents au
moment de leur indexation, ainsi que pour rechercher l’information. La
recherche d'information est aujourd'hui un champ pluridisciplinaire,
intéressant même les sciences cognitives.

La recherche d'information sur le web à l'aide d'un moteur de recherche est
une technique de l'information et de la communication, désormais
massivement adoptée par les usagers.


here is the RDF describing my entity:
<rdf:RDF
    xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#";
    xmlns:j.0="http://www.edf.fr/EdfAcronyme.owl#";
    xmlns:j.1="http://xmlns.com/foaf/0.1/";
    xmlns:owl="http://www.w3.org/2002/07/owl#";
    xmlns:j.3="http://purl.org/dc/terms/";
    xmlns:j.2="http://stanbol.apache.org/ontology/entityhub/entityhub#";
    xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"; >
  <rdf:Description rdf:about="http://www.edf.fr/EdfAcronyme.owl#LA.meta";>
    <j.2:isChached rdf:datatype="http://www.w3.org/2001/XMLSchema#boolean
">true</j.2:isChached>
    <j.1:primaryTopic rdf:resource="
http://www.edf.fr/EdfAcronyme.owl#LA.meta"/>
    <rdf:type rdf:resource="http://xmlns.com/foaf/0.1/Document"/>
    <j.2:about rdf:resource="http://www.edf.fr/EdfAcronyme.owl#LA"/>
    <j.2:site rdf:datatype="http://www.w3.org/2001/XMLSchema#string
">EDFAcronyme</j.2:site>
  </rdf:Description>
 * <rdf:Description rdf:about="http://www.edf.fr/EdfAcronyme.owl#LA";>
    <rdfs:label>LA</rdfs:label>*
    <rdf:type rdf:resource="http://www.edf.fr/EdfAcronyme.owl#Acronyme"/>
    <rdf:type rdf:resource="http://www.w3.org/2002/07/owl#NamedIndividual"/>
    <j.1:name>LA</j.1:name>
    <j.3:description>License Application</j.3:description>
    <j.1:isPrimaryTopicOf rdf:resource="
http://www.edf.fr/EdfAcronyme.owl#LA.meta"/>
  </rdf:Description>
</rdf:RDF>

2013/6/3 Rupert Westenthaler <rupert.westentha...@gmail.com>

> Hi Joseph,
>
> you are right the 'Upper Case Token Mode' interferes with the
> configured UpperCase mode. Maybe it would be better to remove the
> 'Upper Case Token Mode' parameter introduced by STANBOL-1049 and
> implement a similar functionality by using the existing "Upper Case"
> parameter. But I am not yet completely sure if this is possible. I any
> case I will link your previous mail with this issue and not this as an
> unresolved issue for STANBOL-1049.
>
> I think in you specific case it would be best to use a very low
> probability setting (e.g. prop=0.001) as it seams that a lot of the
> suggestions of Talismane are ok, even if they do have a very low
> probability. This would avoid the "unknown POS tag fallback" to take
> effect and therefore workaround the described issues.
>
> In addition you should consider to activate case sensitive matching.
> This would also ensure that 'La' in the text is NOT matched with 'LA'
> in the controlled vocabulary.
>
> Let me also add something about Upper Case and sentence start.
>
> On Mon, Jun 3, 2013 at 11:50 AM, Joseph M'Bimbi-Bene
> <jbi...@object-ive.com> wrote:
> > Here is the configuration of my linking engine:
> >
> > *;lmmtip;uc=NONE;lc=Noun;prop=0.55;pprob=0.75
> >
> > Since I didn't want to have determiner to be linkable when they are
> > uppercased at the beginning of a sentence, i explicitely specified
> > uppercase tokens to not be treated specifically.
>
> Upper Case Tokens at the beginning of sentences or sub-sentences (e.g
> at the begin of a quote) are ignored. So a 'La' at the beginning of a
> sentence MUST NOT be considered as an upper case token. So if you se
> 'La' to be linked at a sentence start, than this would indicate that
> the sentence detection does not work probably.
>
>
I just checked and indeed, there seems to have no sentence
segmentation/detection by talismane. i tried to add a french opennlp model
for phrase segmentation, i am not sure if it works:

OpenNlpSentenceDetectionEngine Sentence Detection Model SentenceModel for
lanugage 'fr' version: 1.5.3

OpenNlpSentenceDetectionEngine > add Sentence: [0, 115]

OpenNlpSentenceDetectionEngine > add Sentence: [116, 248]

OpenNlpSentenceDetectionEngine > add Sentence: [249, 431]

OpenNlpSentenceDetectionEngine > add Sentence: [432, 513]

OpenNlpSentenceDetectionEngine > add Sentence: [514, 552]

OpenNlpSentenceDetectionEngine > add Sentence: [554, 780]

OpenNlpSentenceDetectionEngine > add Sentence: [781, 971]

OpenNlpSentenceDetectionEngine > add Sentence: [972, 1085]

OpenNlpSentenceDetectionEngine > add Sentence: [1087, 1264]


Now, the logs of the processing of the token "La"

ProcessingState > 0: Token: [1087, 1089] La (pos:[Value [pos:
ADJ(olia:Adjective)].prob=0.016871281997002517]) chunk: 'none'

ProcessingState - TokenData: 'La'[linkable=true(linkabkePos=null)|
matchable=true(matchablePos=null)| alpha=true| seachLength=true|
upperCase=true]

[...]

EntityLinker --- preocess Token 0: La (lemma: null) linkable=true,
matchable=true | chunk: none

EntityLinker + 1:'recherche' (lemma: null) linkable=true, matchable=true

EntityLinker >> searchStrings [La, recherche]

EntityLinker - found 1 entities ...

EntityLinker > http://www.edf.fr/EdfAcronyme.owl#LA (ranking: null)

abelTokenizer for language null

abelTokenizer Language null not configured to be supported

abelTokenizer for language null

abelTokenizer Language null not configured to be supported

MainLabelTokenizer > use Tokenizer class
org.apache.stanbol.enhancer.engines.entitylinking.labeltokenizer.opennlp.OpenNlpLabelTokenizer
for language null

MainLabelTokenizer - tokenized la -> [la]

EntityLinker + LA[m=FULL,s=1,c=1(1.0)/1] score=1.0[l=1.0,t=1.0] for
http://www.edf.fr/EdfAcronyme.owl#LA ranking: null

EntityLinker >> Suggestions:

EntityLinker - 0: LA[m=FULL,s=1,c=1(1.0)/1] score=1.0[l=1.0,t=1.0] for
http://www.edf.fr/EdfAcronyme.owl#LA ranking: null


So same as before. Is Open NLP working along well with Talismane. I saw
that the ranking the sentence detection engine was lower than the ranking
of talismane and the linking engine (-100 vs 0) since the documentation of
the engine says
*"Language* (required): The language of the text needs to be available. It
is read as specified by
STANBOL-613<https://issues.apache.org/jira/browse/STANBOL-613>from the
metadata of the ContentItem. Effectively this means that any
Stanbol Language Detection engine will need to be executed *before the
OpenNLP POS Tagging Engine.*" which is Talismane in my case.

The logs are exactly the same, but just for the sake of it (or if i missed
something), i will copy them:

OpenNlpSentenceDetectionEngine > add Sentence: [249, 431]

OpenNlpSentenceDetectionEngine > add Sentence: [432, 513]

OpenNlpSentenceDetectionEngine > add Sentence: [514, 552]

OpenNlpSentenceDetectionEngine > add Sentence: [554, 780]

OpenNlpSentenceDetectionEngine > add Sentence: [781, 971]

OpenNlpSentenceDetectionEngine > add Sentence: [972, 1085]

OpenNlpSentenceDetectionEngine > add Sentence: [1087, 1264]


ProcessingState > 0: Token: [1087, 1089] La (pos:[Value [pos:
ADJ(olia:Adjective)].prob=0.016871281997002517]) chunk: 'none'

ProcessingState - TokenData: 'La'[linkable=true(linkabkePos=null)|
matchable=true(matchablePos=null)| alpha=true| seachLength=true|
upperCase=true]


EntityLinker --- preocess Token 0: La (lemma: null) linkable=true,
matchable=true | chunk: none

EntityLinker + 1:'recherche' (lemma: null) linkable=true, matchable=true

EntityLinker >> searchStrings [La, recherche]

EntityLinker - found 1 entities ...

EntityLinker > http://www.edf.fr/EdfAcronyme.owl#LA (ranking: null)

MainLabelTokenizer > use Tokenizer class
org.apache.stanbol.enhancer.engines.entitylinking.labeltokenizer.lucene.LuceneLabelTokenizer
for language null

03.06.2013 15:41:53.188 *TRACE* [Thread-5674]
org.apache.stanbol.enhancer.engines.entitylinking.labeltokenizer.lucene.LuceneLabelTokenizer
Language null not configured to be supported

MainLabelTokenizer > use Tokenizer class
org.apache.stanbol.enhancer.engines.entitylinking.labeltokenizer.lucene.LuceneLabelTokenizer
for language null

03.06.2013 15:41:53.188 *TRACE* [Thread-5674]
org.apache.stanbol.enhancer.engines.entitylinking.labeltokenizer.lucene.LuceneLabelTokenizer
Language null not configured to be supported

MainLabelTokenizer > use Tokenizer class
org.apache.stanbol.enhancer.engines.entitylinking.labeltokenizer.opennlp.OpenNlpLabelTokenizer
for language null

MainLabelTokenizer - tokenized la -> [la]

EntityLinker + LA[m=FULL,s=1,c=1(1.0)/1] score=1.0[l=1.0,t=1.0] for
http://www.edf.fr/EdfAcronyme.owl#LA ranking: null

EntityLinker >> Suggestions:

EntityLinker - 0: LA[m=FULL,s=1,c=1(1.0)/1] score=1.0[l=1.0,t=1.0] for
http://www.edf.fr/EdfAcronyme.owl#LA ranking: null




Can you sent the text sample you used, so that I can check why
> Talismane fails to correctly split the sentences.
>
> best
> Rupert
>
> > Here are some log excerpts:
> >
> > On token 'La', which is (i think) a determiner, anyway, definitely not a
> > Noun :
> >
> > ProcessingState > *15: Token: [1087, 1089] La* (pos:[Value [pos: *
> > ADJ(olia:Adjective)].prob=0.016871281997002517*]) chunk: 'none'
> > ProcessingState - TokenData: 'La'[linkable=true(linkabkePos=null)|
> > matchable=true(matchablePos=null)| alpha=true| seachLength=true|
> > upperCase=true]
> >
> > EntityLinker --- *preocess Token 15: La* (lemma: null) linkable=true,
> > matchable=true | chunk: none
> >
>
> Here it says that La is the 15th token of the Sentence. This is the
> reason why it is marked as linkable.
>
>
ok, i think I understand ... but if i get it right, then by lowercasing it,
the token should be linked / linkable too. But it is not, Search "<look
here>" for the part of the message related to it


>
> > EntityLinker + 14:'cognitives.' (lemma: null) linkable=true,
> matchable=true
> >
> > EntityLinker + 16:'recherche' (lemma: null) linkable=true, matchable=true
> >
> > EntityLinker >> searchStrings [La, recherche]
> >
> > EntityLinker - found 1 entities ...
> >
> > EntityLinker > http://www.edf.fr/EdfAcronyme.owl#LA (ranking: null)
> >
> > MainLabelTokenizer > use Tokenizer class
> >
> org.apache.stanbol.enhancer.engines.entitylinking.labeltokenizer.lucene.LuceneLabelTokenizer
> > for language null
> >
> > 03.06.2013 11:11:30.809 *TRACE* [Thread-5419]
> >
> org.apache.stanbol.enhancer.engines.entitylinking.labeltokenizer.lucene.LuceneLabelTokenizer
> > Language null not configured to be supported
> >
> > MainLabelTokenizer > use Tokenizer class
> >
> org.apache.stanbol.enhancer.engines.entitylinking.labeltokenizer.lucene.LuceneLabelTokenizer
> > for language null
> >
> > 03.06.2013 11:11:30.809 *TRACE* [Thread-5419]
> >
> org.apache.stanbol.enhancer.engines.entitylinking.labeltokenizer.lucene.LuceneLabelTokenizer
> > Language null not configured to be supported
> >
> > MainLabelTokenizer > use Tokenizer class
> >
> org.apache.stanbol.enhancer.engines.entitylinking.labeltokenizer.opennlp.OpenNlpLabelTokenizer
> > for language null
> >
> > MainLabelTokenizer - tokenized la -> [la]
> >
> > EntityLinker + LA[m=FULL,s=1,c=1(1.0)/1] score=1.0[l=1.0,t=1.0] for
> > http://www.edf.fr/EdfAcronyme.owl#LA ranking: null
> > EntityLinker >> Suggestions:EntityLinker - 0: LA[m=FULL,s=1,c=1(1.0)/1]
> > score=1.0[l=1.0,t=1.0] for http://www.edf.fr/EdfAcronyme.owl#LA ranking:
> > null
> >
> >
> > Then i went to the page of the jira issue 1049 and i guessed my token
> > corresponded to "unknown POS tag rule".
> > "TextProcessingConfig#linkOnlyUpperCaseTokensWithMissingPosTag" -> does
> > this have anything to do with  the the *Upper Case Token Mode *parameter
> ?*
> > *
> > Since my tokens 'La' are always at the beginning of the sentence, i
> guessed
> > they falled in the category:
> > "else - lower case token or sentence or sub-sentence start
> >         * tokens equals or longer as
> > TextProcessingConfig#minSearchTokenLength are marked as matchable"
> >
> > I don't understand that rule: is that supposed to override the *Upper
> Case
> > Token Mode *parameter ? Anyway i tried with all 'La' lowercased, ie to
> 'la'
> > and the tokens 'la are never processed. Here is the log excerpt:
> >
> > ProcessingState > *15: Token: [1087, 1089] la* (pos:[Value [pos:
> > DET(olia:Determiner|olia:PronounOrDeterminer)].prob=0.9445673708042409])
> > chunk: 'none'
> >
>

<look here>


> > ProcessingState - TokenData: 'la'[linkable=false(*linkabkePos=false*)|
> > matchable=false(*matchablePos=false*)| alpha=true| seachLength=true|
> > upperCase=false]
> >
> >
> > After i few minutes of reflexion, i see that linkabkePos and
> matchablePos are
> > no longer equals to "null". What is the rule to set them to null or not.
> It
> > is strange that just an uppercase can change the POS tag of the token
> that
> > drastically for Talismane but i cannot do anything about it. I still have
> > the interrogation about the supposed overriding of the *Upper Case Token
> > Mode *parameter for "unknown POS tag rule".
> >
> >
> >
> > On a quite related topic, the *Upper Case Token Mode *parameter doesn't
> > seem to behave properly (or i missed something). i let "uc=NONE" in the
> > config of the engine and monitored the processing of the token, here are
> > the logs. On the token "utilisée" for the text: "AE est une mesure
> > couramment utilisée."
> >
> > ProcessingState   > 5: Token: [543, 551] utilisÃ©e (pos:[Value [pos:
> > VPP(olia:PastParticiple|olia:Verb)].prob=0.9864354941576942]) chunk:
> 'none'
> > ProcessingState     - TokenData:
> > 'utilisÃ©e'[linkable=false(linkabkePos=false)|
> > matchable=false(matchablePos=false)| alpha=true| seachLength=true|
> > upperCase=false]
> >
> > token is not processed, which i am fine with since its POS tag is VPP
> >
> >
> > Now On the token "Utilisée" for the text: "AE est une mesure couramment
> > Utilisée."
> > ProcessingState   > 5: Token: [543, 551] UtilisÃ©e (pos:[Value [*pos:
> NPP*
> > (olia:ProperNoun|olia:Noun)].*prob=0.19181597467804898*]) chunk: 'none'
> > ProcessingState     - TokenData:
> > 'UtilisÃ©e'[linkable=true(linkabkePos=null)|
> > matchable=true(matchablePos=null)| alpha=true| seachLength=true|
> > upperCase=true]
> >
> > so the POS tag is OK, but the prob doesn't reach the threshold (which i
> set
> > to 0.55), here is the log of the processing of the token
> >
> > EntityLinker --- preocess Token 5: UtilisÃ©e (lemma: null) linkable=true,
> > matchable=true | chunk: none
> > EntityLinker     - 4:'couramment' (lemma: null) linkable=false,
> > matchable=false
> > EntityLinker     - 6:'.' (lemma: null) linkable=false, matchable=false
> > EntityLinker     + 3:'mesure' (lemma: null) linkable=true, matchable=true
> > EntityLinker   >> searchStrings [mesure, UtilisÃ©e]
> >
> > is it a problem of processing of POS tagging, of UpperCase linking or
> did i
> > misunderstood something.
> >
> > Thank you for the time you spend helping us users, it is very
> appreciated.
> > best regard, Joseph
> >
> > 2013/6/3 Rupert Westenthaler <rupert.westentha...@gmail.com>
> >
> >> Hi Joseph
> >>
> >> On Mon, Jun 3, 2013 at 10:01 AM, Joseph M'Bimbi-Bene
> >> <jbi...@object-ive.com> wrote:
> >> > I think it is the tokenizing process of Talismane NLP, since my
> >> enhancement
> >> > chain is :
> >> > -langdetect
> >> > -talismaneNLP
> >> > -MyVocabulary
> >> >
> >>
> >> I also used Talismane when testing and I was not seeing tokens like that
> >>
> >> Here are an excerpt of my log (with minSearchTokenLength set to 2)
> >>
> >> --- preocess Token 11: AE (lemma: null) linkable=true, matchable=true
> >> | chunk: none
> >>     - 10:'*' (lemma: null) linkable=false, matchable=false
> >>     - 12:'*' (lemma: null) linkable=false, matchable=false
> >>     - 9:'indiquant' (lemma: null) linkable=false, matchable=false
> >>     - 13:'une' (lemma: null) linkable=false, matchable=false
> >>     - 8:')' (lemma: null) linkable=false, matchable=false
> >>     + 14:'servitude' (lemma: null) linkable=false, matchable=true
> >>   >> searchStrings [AE, servitude]
> >>
> >> best
> >> Rupert
> >>
> >>
> >> --
> >> | Rupert Westenthaler             rupert.westentha...@gmail.com
> >> | Bodenlehenstraße 11                             ++43-699-11108907
> >> | A-5500 Bischofshofen
> >>
>
>
>
> --
> | Rupert Westenthaler             rupert.westentha...@gmail.com
> | Bodenlehenstraße 11                             ++43-699-11108907
> | A-5500 Bischofshofen
>

Re: Problem with entityLinking on Uppercase tokens

Reply via email to