Re: stanbol bug report

Rupert Westenthaler Tue, 14 May 2013 04:59:48 -0700

Hi Joseph,


On Tue, May 14, 2013 at 12:44 PM, Joseph M'Bimbi-Bene
<[email protected]> wrote:
>
>> (3) Talismane Integration :
>> ===================
>>
>> I added an Entity
>>
>> <rdf:Description rdf:about="http://example.org/resource/Mario";>
>>     <skos:prefLabel>Mario</skos:prefLabel>
>>     <skos:altLabel>le plombier moustachu</skos:altLabel>
>>     <rdfs:label>Mario</rdfs:label>
>>     <rdfs:label>le plombier moustachu</rdfs:label>
>>     <rdf:type>http://example.org/concept#gentil</rdf:type>
>>     <rdf:type>http://example.org/concept#humain</rdf:type>
>>   </rdf:Description>
>>
>> configured an EnhancementChain with
>>
>> * langdetect
>> * talismane-nlp
>> * EntityLinkingEngine for the site with the Entity and DEACTIVATED
>> proper noun linking
>>
>> and sent the text
>>
>>     Mario Kart 7, le plombier moustachu est toujours un pilote d'élite
>>
>>
> it works well with this very text, but for example, with the text "Mario
> Kart 7, le plombier conducteur moustachu est toujours un pilote d'élite",
> only Mario gets recognized

"le plombier conducteur moustachu" does not match "le plombier
moustachu" as "conducteur" is a matchable token and labels of entities
MUST contain all matchable tokens within the text. If "conducteur"
would be not an matchable nor linkable token your assumption - that
you do get a match, but with an lower score - would be correct.

The reason for this rule is to avoid false positives.

> Here is an extract from the logs:
[..]
> EntityLinker - 4:'le' (lemma: null) linkable=false, matchable=false

If 'le' would be missing in the label it would still match, as this is
not a matchable token

>
> EntityLinker + 6:'conducteur' (lemma: null) linkable=true, matchable=true

'conducteur' is matchable=true, therefore labels missing this token
will not match.

> EntityLinker >> searchStrings *[plombier, conducteur]*

and

>
> I guess i misunderstood the process. What is the role of "searchstring" and
> the tokens inside precisely ? The documentation says the query
> "{lt}@{lang} || {lt}@{dl} || [{at}@{lang} || {at}@{dl} ... ]" and "Tokens
> in the Label are matched with Tokens in the text until the first matchable
> or 2nd non-matchable token is not found". Here is the logs describing the
> tokens

searchStrings are the arguments used for Queries in the Vocabulary.
Label matching is performed on the results of such queries. This means
that it is possible (and in fact not so unlikely) that queries do
return results, but the label matching will not accept a single of
those.

Label Matching works like follows:

* Tokens in the Text are compared with Tokens in the Label until
   * a matchable token is not present in the label
   * the second 'other' token is not present in the label (non alpha
numeric tokens are not counted)
* Tokens in the Text earlier as the current Token are compared to not
yet matched Tokens in the Label
   * until the index of the last already matched Tokens is reached
   * a matchable token is not present in the label
   * the second 'other' token is not present in the label (non alpha
numeric tokens are not counted)

Tokens in the text must not exactly match tokens in the label. By
default only the first 75% of the chars need to match (token match
factor). If Tokens in the label are not in the same order as in the
text the confidence of the match is reduced.

I hope this answers your questions.

best
Rupert


--
| Rupert Westenthaler             [email protected]
| Bodenlehenstraße 11                             ++43-699-11108907
| A-5500 Bischofshofen

Re: stanbol bug report

Reply via email to