Thank you for your quick answer. Here is the text i used: La recherche d'information (RI1) est le domaine qui étudie la manière de retrouver des informations dans un corpus. Celui-ci est composé de documents d'une ou plusieurs bases de données, qui sont décrits par un contenu ou les métadonnées associées. Les bases de données peuvent être relationnelles ou non structurées, telles celles mises en réseau par des liens hypertexte comme dans le World Wide Web, l'internet et les intranets. Le contenu des documents peut être du texte, des sons, ses images ou des données. AE est une mesure couramment utilisée.
La recherche d'information est historiquement liée aux sciences de l'information et à la bibliothéconomie qui visent à représenter des documents dans le but d'en récupérer des informations, au moyen de la construction d’index. L’informatique a permis le développement d’outils pour traiter l’information et établir la représentation des documents au moment de leur indexation, ainsi que pour rechercher l’information. La recherche d'information est aujourd'hui un champ pluridisciplinaire, intéressant même les sciences cognitives. La recherche d'information sur le web à l'aide d'un moteur de recherche est une technique de l'information et de la communication, désormais massivement adoptée par les usagers. here is the RDF describing my entity: <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:j.0="http://www.edf.fr/EdfAcronyme.owl#" xmlns:j.1="http://xmlns.com/foaf/0.1/" xmlns:owl="http://www.w3.org/2002/07/owl#" xmlns:j.3="http://purl.org/dc/terms/" xmlns:j.2="http://stanbol.apache.org/ontology/entityhub/entityhub#" xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#" > <rdf:Description rdf:about="http://www.edf.fr/EdfAcronyme.owl#LA.meta"> <j.2:isChached rdf:datatype="http://www.w3.org/2001/XMLSchema#boolean ">true</j.2:isChached> <j.1:primaryTopic rdf:resource=" http://www.edf.fr/EdfAcronyme.owl#LA.meta"/> <rdf:type rdf:resource="http://xmlns.com/foaf/0.1/Document"/> <j.2:about rdf:resource="http://www.edf.fr/EdfAcronyme.owl#LA"/> <j.2:site rdf:datatype="http://www.w3.org/2001/XMLSchema#string ">EDFAcronyme</j.2:site> </rdf:Description> * <rdf:Description rdf:about="http://www.edf.fr/EdfAcronyme.owl#LA"> <rdfs:label>LA</rdfs:label>* <rdf:type rdf:resource="http://www.edf.fr/EdfAcronyme.owl#Acronyme"/> <rdf:type rdf:resource="http://www.w3.org/2002/07/owl#NamedIndividual"/> <j.1:name>LA</j.1:name> <j.3:description>License Application</j.3:description> <j.1:isPrimaryTopicOf rdf:resource=" http://www.edf.fr/EdfAcronyme.owl#LA.meta"/> </rdf:Description> </rdf:RDF> 2013/6/3 Rupert Westenthaler <rupert.westentha...@gmail.com> > Hi Joseph, > > you are right the 'Upper Case Token Mode' interferes with the > configured UpperCase mode. Maybe it would be better to remove the > 'Upper Case Token Mode' parameter introduced by STANBOL-1049 and > implement a similar functionality by using the existing "Upper Case" > parameter. But I am not yet completely sure if this is possible. I any > case I will link your previous mail with this issue and not this as an > unresolved issue for STANBOL-1049. > > I think in you specific case it would be best to use a very low > probability setting (e.g. prop=0.001) as it seams that a lot of the > suggestions of Talismane are ok, even if they do have a very low > probability. This would avoid the "unknown POS tag fallback" to take > effect and therefore workaround the described issues. > > In addition you should consider to activate case sensitive matching. > This would also ensure that 'La' in the text is NOT matched with 'LA' > in the controlled vocabulary. > > Let me also add something about Upper Case and sentence start. > > On Mon, Jun 3, 2013 at 11:50 AM, Joseph M'Bimbi-Bene > <jbi...@object-ive.com> wrote: > > Here is the configuration of my linking engine: > > > > *;lmmtip;uc=NONE;lc=Noun;prop=0.55;pprob=0.75 > > > > Since I didn't want to have determiner to be linkable when they are > > uppercased at the beginning of a sentence, i explicitely specified > > uppercase tokens to not be treated specifically. > > Upper Case Tokens at the beginning of sentences or sub-sentences (e.g > at the begin of a quote) are ignored. So a 'La' at the beginning of a > sentence MUST NOT be considered as an upper case token. So if you se > 'La' to be linked at a sentence start, than this would indicate that > the sentence detection does not work probably. > > I just checked and indeed, there seems to have no sentence segmentation/detection by talismane. i tried to add a french opennlp model for phrase segmentation, i am not sure if it works: OpenNlpSentenceDetectionEngine Sentence Detection Model SentenceModel for lanugage 'fr' version: 1.5.3 OpenNlpSentenceDetectionEngine > add Sentence: [0, 115] OpenNlpSentenceDetectionEngine > add Sentence: [116, 248] OpenNlpSentenceDetectionEngine > add Sentence: [249, 431] OpenNlpSentenceDetectionEngine > add Sentence: [432, 513] OpenNlpSentenceDetectionEngine > add Sentence: [514, 552] OpenNlpSentenceDetectionEngine > add Sentence: [554, 780] OpenNlpSentenceDetectionEngine > add Sentence: [781, 971] OpenNlpSentenceDetectionEngine > add Sentence: [972, 1085] OpenNlpSentenceDetectionEngine > add Sentence: [1087, 1264] Now, the logs of the processing of the token "La" ProcessingState > 0: Token: [1087, 1089] La (pos:[Value [pos: ADJ(olia:Adjective)].prob=0.016871281997002517]) chunk: 'none' ProcessingState - TokenData: 'La'[linkable=true(linkabkePos=null)| matchable=true(matchablePos=null)| alpha=true| seachLength=true| upperCase=true] [...] EntityLinker --- preocess Token 0: La (lemma: null) linkable=true, matchable=true | chunk: none EntityLinker + 1:'recherche' (lemma: null) linkable=true, matchable=true EntityLinker >> searchStrings [La, recherche] EntityLinker - found 1 entities ... EntityLinker > http://www.edf.fr/EdfAcronyme.owl#LA (ranking: null) abelTokenizer for language null abelTokenizer Language null not configured to be supported abelTokenizer for language null abelTokenizer Language null not configured to be supported MainLabelTokenizer > use Tokenizer class org.apache.stanbol.enhancer.engines.entitylinking.labeltokenizer.opennlp.OpenNlpLabelTokenizer for language null MainLabelTokenizer - tokenized la -> [la] EntityLinker + LA[m=FULL,s=1,c=1(1.0)/1] score=1.0[l=1.0,t=1.0] for http://www.edf.fr/EdfAcronyme.owl#LA ranking: null EntityLinker >> Suggestions: EntityLinker - 0: LA[m=FULL,s=1,c=1(1.0)/1] score=1.0[l=1.0,t=1.0] for http://www.edf.fr/EdfAcronyme.owl#LA ranking: null So same as before. Is Open NLP working along well with Talismane. I saw that the ranking the sentence detection engine was lower than the ranking of talismane and the linking engine (-100 vs 0) since the documentation of the engine says *"Language* (required): The language of the text needs to be available. It is read as specified by STANBOL-613<https://issues.apache.org/jira/browse/STANBOL-613>from the metadata of the ContentItem. Effectively this means that any Stanbol Language Detection engine will need to be executed *before the OpenNLP POS Tagging Engine.*" which is Talismane in my case. The logs are exactly the same, but just for the sake of it (or if i missed something), i will copy them: OpenNlpSentenceDetectionEngine > add Sentence: [249, 431] OpenNlpSentenceDetectionEngine > add Sentence: [432, 513] OpenNlpSentenceDetectionEngine > add Sentence: [514, 552] OpenNlpSentenceDetectionEngine > add Sentence: [554, 780] OpenNlpSentenceDetectionEngine > add Sentence: [781, 971] OpenNlpSentenceDetectionEngine > add Sentence: [972, 1085] OpenNlpSentenceDetectionEngine > add Sentence: [1087, 1264] ProcessingState > 0: Token: [1087, 1089] La (pos:[Value [pos: ADJ(olia:Adjective)].prob=0.016871281997002517]) chunk: 'none' ProcessingState - TokenData: 'La'[linkable=true(linkabkePos=null)| matchable=true(matchablePos=null)| alpha=true| seachLength=true| upperCase=true] EntityLinker --- preocess Token 0: La (lemma: null) linkable=true, matchable=true | chunk: none EntityLinker + 1:'recherche' (lemma: null) linkable=true, matchable=true EntityLinker >> searchStrings [La, recherche] EntityLinker - found 1 entities ... EntityLinker > http://www.edf.fr/EdfAcronyme.owl#LA (ranking: null) MainLabelTokenizer > use Tokenizer class org.apache.stanbol.enhancer.engines.entitylinking.labeltokenizer.lucene.LuceneLabelTokenizer for language null 03.06.2013 15:41:53.188 *TRACE* [Thread-5674] org.apache.stanbol.enhancer.engines.entitylinking.labeltokenizer.lucene.LuceneLabelTokenizer Language null not configured to be supported MainLabelTokenizer > use Tokenizer class org.apache.stanbol.enhancer.engines.entitylinking.labeltokenizer.lucene.LuceneLabelTokenizer for language null 03.06.2013 15:41:53.188 *TRACE* [Thread-5674] org.apache.stanbol.enhancer.engines.entitylinking.labeltokenizer.lucene.LuceneLabelTokenizer Language null not configured to be supported MainLabelTokenizer > use Tokenizer class org.apache.stanbol.enhancer.engines.entitylinking.labeltokenizer.opennlp.OpenNlpLabelTokenizer for language null MainLabelTokenizer - tokenized la -> [la] EntityLinker + LA[m=FULL,s=1,c=1(1.0)/1] score=1.0[l=1.0,t=1.0] for http://www.edf.fr/EdfAcronyme.owl#LA ranking: null EntityLinker >> Suggestions: EntityLinker - 0: LA[m=FULL,s=1,c=1(1.0)/1] score=1.0[l=1.0,t=1.0] for http://www.edf.fr/EdfAcronyme.owl#LA ranking: null Can you sent the text sample you used, so that I can check why > Talismane fails to correctly split the sentences. > > best > Rupert > > > Here are some log excerpts: > > > > On token 'La', which is (i think) a determiner, anyway, definitely not a > > Noun : > > > > ProcessingState > *15: Token: [1087, 1089] La* (pos:[Value [pos: * > > ADJ(olia:Adjective)].prob=0.016871281997002517*]) chunk: 'none' > > ProcessingState - TokenData: 'La'[linkable=true(linkabkePos=null)| > > matchable=true(matchablePos=null)| alpha=true| seachLength=true| > > upperCase=true] > > > > EntityLinker --- *preocess Token 15: La* (lemma: null) linkable=true, > > matchable=true | chunk: none > > > > Here it says that La is the 15th token of the Sentence. This is the > reason why it is marked as linkable. > > ok, i think I understand ... but if i get it right, then by lowercasing it, the token should be linked / linkable too. But it is not, Search "<look here>" for the part of the message related to it > > > EntityLinker + 14:'cognitives.' (lemma: null) linkable=true, > matchable=true > > > > EntityLinker + 16:'recherche' (lemma: null) linkable=true, matchable=true > > > > EntityLinker >> searchStrings [La, recherche] > > > > EntityLinker - found 1 entities ... > > > > EntityLinker > http://www.edf.fr/EdfAcronyme.owl#LA (ranking: null) > > > > MainLabelTokenizer > use Tokenizer class > > > org.apache.stanbol.enhancer.engines.entitylinking.labeltokenizer.lucene.LuceneLabelTokenizer > > for language null > > > > 03.06.2013 11:11:30.809 *TRACE* [Thread-5419] > > > org.apache.stanbol.enhancer.engines.entitylinking.labeltokenizer.lucene.LuceneLabelTokenizer > > Language null not configured to be supported > > > > MainLabelTokenizer > use Tokenizer class > > > org.apache.stanbol.enhancer.engines.entitylinking.labeltokenizer.lucene.LuceneLabelTokenizer > > for language null > > > > 03.06.2013 11:11:30.809 *TRACE* [Thread-5419] > > > org.apache.stanbol.enhancer.engines.entitylinking.labeltokenizer.lucene.LuceneLabelTokenizer > > Language null not configured to be supported > > > > MainLabelTokenizer > use Tokenizer class > > > org.apache.stanbol.enhancer.engines.entitylinking.labeltokenizer.opennlp.OpenNlpLabelTokenizer > > for language null > > > > MainLabelTokenizer - tokenized la -> [la] > > > > EntityLinker + LA[m=FULL,s=1,c=1(1.0)/1] score=1.0[l=1.0,t=1.0] for > > http://www.edf.fr/EdfAcronyme.owl#LA ranking: null > > EntityLinker >> Suggestions:EntityLinker - 0: LA[m=FULL,s=1,c=1(1.0)/1] > > score=1.0[l=1.0,t=1.0] for http://www.edf.fr/EdfAcronyme.owl#LA ranking: > > null > > > > > > Then i went to the page of the jira issue 1049 and i guessed my token > > corresponded to "unknown POS tag rule". > > "TextProcessingConfig#linkOnlyUpperCaseTokensWithMissingPosTag" -> does > > this have anything to do with the the *Upper Case Token Mode *parameter > ?* > > * > > Since my tokens 'La' are always at the beginning of the sentence, i > guessed > > they falled in the category: > > "else - lower case token or sentence or sub-sentence start > > * tokens equals or longer as > > TextProcessingConfig#minSearchTokenLength are marked as matchable" > > > > I don't understand that rule: is that supposed to override the *Upper > Case > > Token Mode *parameter ? Anyway i tried with all 'La' lowercased, ie to > 'la' > > and the tokens 'la are never processed. Here is the log excerpt: > > > > ProcessingState > *15: Token: [1087, 1089] la* (pos:[Value [pos: > > DET(olia:Determiner|olia:PronounOrDeterminer)].prob=0.9445673708042409]) > > chunk: 'none' > > > <look here> > > ProcessingState - TokenData: 'la'[linkable=false(*linkabkePos=false*)| > > matchable=false(*matchablePos=false*)| alpha=true| seachLength=true| > > upperCase=false] > > > > > > After i few minutes of reflexion, i see that linkabkePos and > matchablePos are > > no longer equals to "null". What is the rule to set them to null or not. > It > > is strange that just an uppercase can change the POS tag of the token > that > > drastically for Talismane but i cannot do anything about it. I still have > > the interrogation about the supposed overriding of the *Upper Case Token > > Mode *parameter for "unknown POS tag rule". > > > > > > > > On a quite related topic, the *Upper Case Token Mode *parameter doesn't > > seem to behave properly (or i missed something). i let "uc=NONE" in the > > config of the engine and monitored the processing of the token, here are > > the logs. On the token "utilisée" for the text: "AE est une mesure > > couramment utilisée." > > > > ProcessingState > 5: Token: [543, 551] utilisée (pos:[Value [pos: > > VPP(olia:PastParticiple|olia:Verb)].prob=0.9864354941576942]) chunk: > 'none' > > ProcessingState - TokenData: > > 'utilisée'[linkable=false(linkabkePos=false)| > > matchable=false(matchablePos=false)| alpha=true| seachLength=true| > > upperCase=false] > > > > token is not processed, which i am fine with since its POS tag is VPP > > > > > > Now On the token "Utilisée" for the text: "AE est une mesure couramment > > Utilisée." > > ProcessingState > 5: Token: [543, 551] Utilisée (pos:[Value [*pos: > NPP* > > (olia:ProperNoun|olia:Noun)].*prob=0.19181597467804898*]) chunk: 'none' > > ProcessingState - TokenData: > > 'Utilisée'[linkable=true(linkabkePos=null)| > > matchable=true(matchablePos=null)| alpha=true| seachLength=true| > > upperCase=true] > > > > so the POS tag is OK, but the prob doesn't reach the threshold (which i > set > > to 0.55), here is the log of the processing of the token > > > > EntityLinker --- preocess Token 5: Utilisée (lemma: null) linkable=true, > > matchable=true | chunk: none > > EntityLinker - 4:'couramment' (lemma: null) linkable=false, > > matchable=false > > EntityLinker - 6:'.' (lemma: null) linkable=false, matchable=false > > EntityLinker + 3:'mesure' (lemma: null) linkable=true, matchable=true > > EntityLinker >> searchStrings [mesure, Utilisée] > > > > is it a problem of processing of POS tagging, of UpperCase linking or > did i > > misunderstood something. > > > > Thank you for the time you spend helping us users, it is very > appreciated. > > best regard, Joseph > > > > 2013/6/3 Rupert Westenthaler <rupert.westentha...@gmail.com> > > > >> Hi Joseph > >> > >> On Mon, Jun 3, 2013 at 10:01 AM, Joseph M'Bimbi-Bene > >> <jbi...@object-ive.com> wrote: > >> > I think it is the tokenizing process of Talismane NLP, since my > >> enhancement > >> > chain is : > >> > -langdetect > >> > -talismaneNLP > >> > -MyVocabulary > >> > > >> > >> I also used Talismane when testing and I was not seeing tokens like that > >> > >> Here are an excerpt of my log (with minSearchTokenLength set to 2) > >> > >> --- preocess Token 11: AE (lemma: null) linkable=true, matchable=true > >> | chunk: none > >> - 10:'*' (lemma: null) linkable=false, matchable=false > >> - 12:'*' (lemma: null) linkable=false, matchable=false > >> - 9:'indiquant' (lemma: null) linkable=false, matchable=false > >> - 13:'une' (lemma: null) linkable=false, matchable=false > >> - 8:')' (lemma: null) linkable=false, matchable=false > >> + 14:'servitude' (lemma: null) linkable=false, matchable=true > >> >> searchStrings [AE, servitude] > >> > >> best > >> Rupert > >> > >> > >> -- > >> | Rupert Westenthaler rupert.westentha...@gmail.com > >> | Bodenlehenstraße 11 ++43-699-11108907 > >> | A-5500 Bischofshofen > >> > > > > -- > | Rupert Westenthaler rupert.westentha...@gmail.com > | Bodenlehenstraße 11 ++43-699-11108907 > | A-5500 Bischofshofen >