Re: problem with entity recognition or linking in french

Rupert Westenthaler Sat, 20 Apr 2013 00:14:58 -0700

On Fri, Apr 19, 2013 at 5:49 PM, Joseph M'Bimbi-Bene
<[email protected]> wrote:
> Hello Rupert,
>
> since i am on it, why is "le" even considered for the matching, I thought
> labels were tokenized and tokens with length < 3 were not even be
> considered for the matching with default config or am i mixing different
> concepts ?


Only Tokens in the Text are processed like described. For Labels no
processing is done.

> Do i have to code my own labelTokenizer ? Since we intend to sell a product
> to a client who has no idea how that thing works and will basically enter
> labels in an excel file or something of that sort, i would like to have
> that behaviour.
>

Never tried it, but this should be possible.

best
Rupert

>
> 2013/4/19 Joseph M'Bimbi-Bene <[email protected]>
>
>> i forgot a screenshot in the document.
>>
>>
>> 2013/4/19 Joseph M'Bimbi-Bene <[email protected]>
>>
>>> I saw thoses lines documentation and actually tried to insert the lines
>>> directy in the .config file of the engine in
>>> {stanbol-install-dir}/stabol/fileinstall.
>>> Then i saw your answer and tried it, but it doesn't work.
>>> I prepared a pdf doc with screenshots describing what i did and the
>>> results, i think it will be better than
>>>
>>>
>>> 2013/4/19 Rupert Westenthaler <[email protected]>
>>>
>>>> Hi Joseph:
>>>>
>>>> The reason for your results is the "Min Label Score"
>>>> (enhancer.engines.linking.minLabelScore) parameter of the
>>>> EntityLinkingEngine.
>>>>
>>>> Copied from [1]
>>>>
>>>>  * Min Label Score (enhancer.engines.linking.minLabelScore)
>>>> [0..1]::double: The "Label Score" [0..1] represents how much of the
>>>> Label of an Entity matches with the Text. It compares the number of
>>>> Tokens of the Label with the number of Tokens matched to the Text. Not
>>>> exact matches for Tokens, or if the Tokens within the label do appear
>>>> in an other order than in the text do also reduce this score. Entities
>>>> are only considered if at least one of their labels cores higher than
>>>> the minimum for all tree of Min Labe Score, Min Text Match Score and
>>>> Min Match Score.
>>>>
>>>> The default value of this parameter is "0.75".
>>>>
>>>> In your case where "cette plombier moustachu" is matched against "le
>>>> plombier moustachu" the actual label match score is only "0.667" (2/3
>>>> tokens of the label do match the text). Because of that the Entity is
>>>> not linked in that case.
>>>>
>>>> If you would like to link Entities where two out of tree tokens match
>>>> with the text you should lower the configuration of minLabelScore to
>>>> values < "0.66" e.g.
>>>>
>>>>     enhancer.engines.linking.minLabelScore="0.55"
>>>>
>>>> NOTE: As this property is not included in the configuration dialog of
>>>> config tab of the Felix Webconsole you will need to set it directly
>>>> via the config file of the engine instance. See [2] how to mange your
>>>> configuration within the 'stanbol/fileinstall' folder.
>>>>
>>>> To create a configuration file for the EntityhubLinkingEngine you can
>>>> follow the following steps
>>>>
>>>> 1. To get a config file to start with just go look at
>>>>
>>>> 'stanbol/config/org/apache/stanbol/enhancer/engines/entityhublinking/EntityhubLinkingEngine'
>>>> and take the '{uid}.config' files of the engine you are currently
>>>> using.
>>>>
>>>> 2. Next you will need to name the file like
>>>>
>>>> "org.apache.stanbol.enhancer.engines.entityhublinking.EntityhubLinkingEngine-{configname}"
>>>> where {configname} should be a human readable name for your
>>>> configuration.
>>>>
>>>> 3. Now you can edit the file using a TextEditor:
>>>>
>>>>     * remove the "service.bundleLocation", "service.factoryPid" and
>>>> "service.pid" keys. Those are set by the OSGI environment and should
>>>> not be in the config
>>>>     * add the configuration of the minLabelScore property
>>>> 'enhancer.engines.linking.minLabelScore="0.55"'
>>>>     * you can change/add other configuration parameters as described in
>>>> [1]
>>>>
>>>> 4. Finally you need to (1) delete the current configuration of your
>>>> engine via the "config" tab of the Felix Webconsole and (2) copy your
>>>> configuration file to the 'stanbol/fileinstall' folder.
>>>>
>>>> best
>>>> Rupert
>>>>
>>>> [1]
>>>> http://stanbol.staging.apache.org/docs/trunk/components/enhancer/engines/entitylinking#entity-linker-configuration
>>>> [2]
>>>> http://stanbol.staging.apache.org/docs/trunk/production-mode/partial-updates.html
>>>>
>>>> On Thu, Apr 18, 2013 at 5:22 PM, Rupert Westenthaler
>>>> <[email protected]> wrote:
>>>> > On Thu, Apr 18, 2013 at 4:04 PM, Joseph M'Bimbi-Bene
>>>> > <[email protected]> wrote:
>>>> >> Thank you for your answer.
>>>> >>
>>>> >> But i misunderstood your indication. I mean, i thought i could
>>>> specify a
>>>> >> specific word to be linkable or matchable.
>>>> >>
>>>> >> I have another question : how can i see the score when there is no
>>>> match ?
>>>> >>
>>>> >
>>>> > If there is no match then there is no score.
>>>> >
>>>> > [..log..]
>>>> >> ?
>>>> >
>>>> > OK I can see your point. This is indeed a strange behavior. To be
>>>> > honest I have not tested much in settings without POS tags. So this
>>>> > might be as well a bug.
>>>> >
>>>> > I will try to reproduce this to have a detailed look what is going on.
>>>> >
>>>> > best
>>>> > Rupert
>>>> >
>>>> >>
>>>> >> I tried nlp2rdf, and in the resulting rdf, i cannot see it (maybe i
>>>> missed
>>>> >> it though, there is so much information displayed, i am kinda lost)
>>>> >>
>>>> >>
>>>> >> 2013/4/18 Rupert Westenthaler <[email protected]>
>>>> >>
>>>> >>> On Thu, Apr 18, 2013 at 3:16 PM, Joseph M'Bimbi-Bene
>>>> >>> <[email protected]> wrote:
>>>> >>> > I don't see the option, can you give me the procedure or a more
>>>> precise
>>>> >>> > indication please ?
>>>> >>> >
>>>> >>>
>>>> >>> If you do not want to use POS tagging, than the options are limited:
>>>> >>>
>>>> >>> * uc {NONE/MATCH/LINK}::string - the Upper Case Token Mode allows to
>>>> >>> configure how upper case words are treated. There are three possible
>>>> >>> modes: (1) NONE: defines that they are not specially treated; (2)
>>>> >>> MATCH defines that they are considered as matchable tokens
>>>> >>> (independent of the POS tag or the token length; (3) LINK: defines
>>>> >>> that they are in any case linked with the vocabulary. The default is
>>>> >>> "LINK" - as upper case words often represent named entities - with
>>>> the
>>>> >>> exception of German ('de') where the mode is set to MATCH - as all
>>>> >>> Nouns in German are upper case.
>>>> >>>
>>>> >>> e.g.
>>>> >>>
>>>> >>>
>>>> >>>
>>>> org.apache.stanbol.enhancer.engines.keywordextraction.processedLanguages=["fr;uc\=MATCH"]
>>>> >>> enhancer.engines.linking.minSearchTokenLength=3
>>>> >>>
>>>> >>> This would MATCH all upper case and words with three or more chars.
>>>> >>>
>>>> >>> However if you vocabulary does contain Entities that would appear in
>>>> >>> texts as specific POS (e.g. Nouns) I would really recommend you to
>>>> >>> give POS tagging a try.
>>>> >>>
>>>> >>> If you like you can try to process some of your texts using the
>>>> >>>
>>>> >>> * DBpedia proper noun linking on
>>>> >>> http://dev.iks-project.eu:8081/enhancer/chain/dbpedia-proper-noun
>>>> >>> * Freebase proper noun linking currently running in an early test
>>>> >>> version on
>>>> >>> http://dev.iks-project.eu:8083/enhancer/chain/freebase-proper-noun
>>>> >>>
>>>> >>> both chains do use the talismane integration [1] for NLP processing
>>>> >>>
>>>> >>> best
>>>> >>> Rupert
>>>> >>>
>>>> >>> > best
>>>> >>> > Rupert
>>>> >>> >
>>>> >>> >
>>>> >>> > [1] https://github.com/westei/stanbol-talismane
>>>> >>> > [2] http://dev.iks-project.eu:8081/enhancer/chain/NIF-demo
>>>> >>> > [3]
>>>> >>> >
>>>> >>>
>>>> http://stanbol.apache.org/docs/trunk/components/enhancer/engines/entitylinking#linking-process
>>>> >>> >
>>>> >>> > --
>>>> >>> > | Rupert Westenthaler             [email protected]
>>>> >>> > | Bodenlehenstraße 11
>>>> ++43-699-11108907
>>>> >>> > | A-5500 Bischofshofen
>>>> >>>
>>>> >>>
>>>> >>>
>>>> >>> --
>>>> >>> | Rupert Westenthaler             [email protected]
>>>> >>> | Bodenlehenstraße 11                             ++43-699-11108907
>>>> >>> | A-5500 Bischofshofen
>>>> >>>
>>>> >
>>>> >
>>>> >
>>>> > --
>>>> > | Rupert Westenthaler             [email protected]
>>>> > | Bodenlehenstraße 11                             ++43-699-11108907
>>>> > | A-5500 Bischofshofen
>>>>
>>>>
>>>>
>>>> --
>>>> | Rupert Westenthaler             [email protected]
>>>> | Bodenlehenstraße 11                             ++43-699-11108907
>>>> | A-5500 Bischofshofen
>>>>
>>>
>>>
>>



--
| Rupert Westenthaler             [email protected]
| Bodenlehenstraße 11                             ++43-699-11108907
| A-5500 Bischofshofen

Re: problem with entity recognition or linking in french

Reply via email to