Hi Joseph,
On Tue, May 14, 2013 at 12:44 PM, Joseph M'Bimbi-Bene <[email protected]> wrote: > >> (3) Talismane Integration : >> =================== >> >> I added an Entity >> >> <rdf:Description rdf:about="http://example.org/resource/Mario"> >> <skos:prefLabel>Mario</skos:prefLabel> >> <skos:altLabel>le plombier moustachu</skos:altLabel> >> <rdfs:label>Mario</rdfs:label> >> <rdfs:label>le plombier moustachu</rdfs:label> >> <rdf:type>http://example.org/concept#gentil</rdf:type> >> <rdf:type>http://example.org/concept#humain</rdf:type> >> </rdf:Description> >> >> configured an EnhancementChain with >> >> * langdetect >> * talismane-nlp >> * EntityLinkingEngine for the site with the Entity and DEACTIVATED >> proper noun linking >> >> and sent the text >> >> Mario Kart 7, le plombier moustachu est toujours un pilote d'élite >> >> > it works well with this very text, but for example, with the text "Mario > Kart 7, le plombier conducteur moustachu est toujours un pilote d'élite", > only Mario gets recognized "le plombier conducteur moustachu" does not match "le plombier moustachu" as "conducteur" is a matchable token and labels of entities MUST contain all matchable tokens within the text. If "conducteur" would be not an matchable nor linkable token your assumption - that you do get a match, but with an lower score - would be correct. The reason for this rule is to avoid false positives. > Here is an extract from the logs: [..] > EntityLinker - 4:'le' (lemma: null) linkable=false, matchable=false If 'le' would be missing in the label it would still match, as this is not a matchable token > > EntityLinker + 6:'conducteur' (lemma: null) linkable=true, matchable=true 'conducteur' is matchable=true, therefore labels missing this token will not match. > EntityLinker >> searchStrings *[plombier, conducteur]* and > > I guess i misunderstood the process. What is the role of "searchstring" and > the tokens inside precisely ? The documentation says the query > "{lt}@{lang} || {lt}@{dl} || [{at}@{lang} || {at}@{dl} ... ]" and "Tokens > in the Label are matched with Tokens in the text until the first matchable > or 2nd non-matchable token is not found". Here is the logs describing the > tokens searchStrings are the arguments used for Queries in the Vocabulary. Label matching is performed on the results of such queries. This means that it is possible (and in fact not so unlikely) that queries do return results, but the label matching will not accept a single of those. Label Matching works like follows: * Tokens in the Text are compared with Tokens in the Label until * a matchable token is not present in the label * the second 'other' token is not present in the label (non alpha numeric tokens are not counted) * Tokens in the Text earlier as the current Token are compared to not yet matched Tokens in the Label * until the index of the last already matched Tokens is reached * a matchable token is not present in the label * the second 'other' token is not present in the label (non alpha numeric tokens are not counted) Tokens in the text must not exactly match tokens in the label. By default only the first 75% of the chars need to match (token match factor). If Tokens in the label are not in the same order as in the text the confidence of the match is reduced. I hope this answers your questions. best Rupert -- | Rupert Westenthaler [email protected] | Bodenlehenstraße 11 ++43-699-11108907 | A-5500 Bischofshofen
