Hello Rupert, since i am on it, why is "le" even considered for the matching, I thought labels were tokenized and tokens with length < 3 were not even be considered for the matching with default config or am i mixing different concepts ? Do i have to code my own labelTokenizer ? Since we intend to sell a product to a client who has no idea how that thing works and will basically enter labels in an excel file or something of that sort, i would like to have that behaviour.
2013/4/19 Joseph M'Bimbi-Bene <[email protected]> > i forgot a screenshot in the document. > > > 2013/4/19 Joseph M'Bimbi-Bene <[email protected]> > >> I saw thoses lines documentation and actually tried to insert the lines >> directy in the .config file of the engine in >> {stanbol-install-dir}/stabol/fileinstall. >> Then i saw your answer and tried it, but it doesn't work. >> I prepared a pdf doc with screenshots describing what i did and the >> results, i think it will be better than >> >> >> 2013/4/19 Rupert Westenthaler <[email protected]> >> >>> Hi Joseph: >>> >>> The reason for your results is the "Min Label Score" >>> (enhancer.engines.linking.minLabelScore) parameter of the >>> EntityLinkingEngine. >>> >>> Copied from [1] >>> >>> * Min Label Score (enhancer.engines.linking.minLabelScore) >>> [0..1]::double: The "Label Score" [0..1] represents how much of the >>> Label of an Entity matches with the Text. It compares the number of >>> Tokens of the Label with the number of Tokens matched to the Text. Not >>> exact matches for Tokens, or if the Tokens within the label do appear >>> in an other order than in the text do also reduce this score. Entities >>> are only considered if at least one of their labels cores higher than >>> the minimum for all tree of Min Labe Score, Min Text Match Score and >>> Min Match Score. >>> >>> The default value of this parameter is "0.75". >>> >>> In your case where "cette plombier moustachu" is matched against "le >>> plombier moustachu" the actual label match score is only "0.667" (2/3 >>> tokens of the label do match the text). Because of that the Entity is >>> not linked in that case. >>> >>> If you would like to link Entities where two out of tree tokens match >>> with the text you should lower the configuration of minLabelScore to >>> values < "0.66" e.g. >>> >>> enhancer.engines.linking.minLabelScore="0.55" >>> >>> NOTE: As this property is not included in the configuration dialog of >>> config tab of the Felix Webconsole you will need to set it directly >>> via the config file of the engine instance. See [2] how to mange your >>> configuration within the 'stanbol/fileinstall' folder. >>> >>> To create a configuration file for the EntityhubLinkingEngine you can >>> follow the following steps >>> >>> 1. To get a config file to start with just go look at >>> >>> 'stanbol/config/org/apache/stanbol/enhancer/engines/entityhublinking/EntityhubLinkingEngine' >>> and take the '{uid}.config' files of the engine you are currently >>> using. >>> >>> 2. Next you will need to name the file like >>> >>> "org.apache.stanbol.enhancer.engines.entityhublinking.EntityhubLinkingEngine-{configname}" >>> where {configname} should be a human readable name for your >>> configuration. >>> >>> 3. Now you can edit the file using a TextEditor: >>> >>> * remove the "service.bundleLocation", "service.factoryPid" and >>> "service.pid" keys. Those are set by the OSGI environment and should >>> not be in the config >>> * add the configuration of the minLabelScore property >>> 'enhancer.engines.linking.minLabelScore="0.55"' >>> * you can change/add other configuration parameters as described in >>> [1] >>> >>> 4. Finally you need to (1) delete the current configuration of your >>> engine via the "config" tab of the Felix Webconsole and (2) copy your >>> configuration file to the 'stanbol/fileinstall' folder. >>> >>> best >>> Rupert >>> >>> [1] >>> http://stanbol.staging.apache.org/docs/trunk/components/enhancer/engines/entitylinking#entity-linker-configuration >>> [2] >>> http://stanbol.staging.apache.org/docs/trunk/production-mode/partial-updates.html >>> >>> On Thu, Apr 18, 2013 at 5:22 PM, Rupert Westenthaler >>> <[email protected]> wrote: >>> > On Thu, Apr 18, 2013 at 4:04 PM, Joseph M'Bimbi-Bene >>> > <[email protected]> wrote: >>> >> Thank you for your answer. >>> >> >>> >> But i misunderstood your indication. I mean, i thought i could >>> specify a >>> >> specific word to be linkable or matchable. >>> >> >>> >> I have another question : how can i see the score when there is no >>> match ? >>> >> >>> > >>> > If there is no match then there is no score. >>> > >>> > [..log..] >>> >> ? >>> > >>> > OK I can see your point. This is indeed a strange behavior. To be >>> > honest I have not tested much in settings without POS tags. So this >>> > might be as well a bug. >>> > >>> > I will try to reproduce this to have a detailed look what is going on. >>> > >>> > best >>> > Rupert >>> > >>> >> >>> >> I tried nlp2rdf, and in the resulting rdf, i cannot see it (maybe i >>> missed >>> >> it though, there is so much information displayed, i am kinda lost) >>> >> >>> >> >>> >> 2013/4/18 Rupert Westenthaler <[email protected]> >>> >> >>> >>> On Thu, Apr 18, 2013 at 3:16 PM, Joseph M'Bimbi-Bene >>> >>> <[email protected]> wrote: >>> >>> > I don't see the option, can you give me the procedure or a more >>> precise >>> >>> > indication please ? >>> >>> > >>> >>> >>> >>> If you do not want to use POS tagging, than the options are limited: >>> >>> >>> >>> * uc {NONE/MATCH/LINK}::string - the Upper Case Token Mode allows to >>> >>> configure how upper case words are treated. There are three possible >>> >>> modes: (1) NONE: defines that they are not specially treated; (2) >>> >>> MATCH defines that they are considered as matchable tokens >>> >>> (independent of the POS tag or the token length; (3) LINK: defines >>> >>> that they are in any case linked with the vocabulary. The default is >>> >>> "LINK" - as upper case words often represent named entities - with >>> the >>> >>> exception of German ('de') where the mode is set to MATCH - as all >>> >>> Nouns in German are upper case. >>> >>> >>> >>> e.g. >>> >>> >>> >>> >>> >>> >>> org.apache.stanbol.enhancer.engines.keywordextraction.processedLanguages=["fr;uc\=MATCH"] >>> >>> enhancer.engines.linking.minSearchTokenLength=3 >>> >>> >>> >>> This would MATCH all upper case and words with three or more chars. >>> >>> >>> >>> However if you vocabulary does contain Entities that would appear in >>> >>> texts as specific POS (e.g. Nouns) I would really recommend you to >>> >>> give POS tagging a try. >>> >>> >>> >>> If you like you can try to process some of your texts using the >>> >>> >>> >>> * DBpedia proper noun linking on >>> >>> http://dev.iks-project.eu:8081/enhancer/chain/dbpedia-proper-noun >>> >>> * Freebase proper noun linking currently running in an early test >>> >>> version on >>> >>> http://dev.iks-project.eu:8083/enhancer/chain/freebase-proper-noun >>> >>> >>> >>> both chains do use the talismane integration [1] for NLP processing >>> >>> >>> >>> best >>> >>> Rupert >>> >>> >>> >>> > best >>> >>> > Rupert >>> >>> > >>> >>> > >>> >>> > [1] https://github.com/westei/stanbol-talismane >>> >>> > [2] http://dev.iks-project.eu:8081/enhancer/chain/NIF-demo >>> >>> > [3] >>> >>> > >>> >>> >>> http://stanbol.apache.org/docs/trunk/components/enhancer/engines/entitylinking#linking-process >>> >>> > >>> >>> > -- >>> >>> > | Rupert Westenthaler [email protected] >>> >>> > | Bodenlehenstraße 11 >>> ++43-699-11108907 >>> >>> > | A-5500 Bischofshofen >>> >>> >>> >>> >>> >>> >>> >>> -- >>> >>> | Rupert Westenthaler [email protected] >>> >>> | Bodenlehenstraße 11 ++43-699-11108907 >>> >>> | A-5500 Bischofshofen >>> >>> >>> > >>> > >>> > >>> > -- >>> > | Rupert Westenthaler [email protected] >>> > | Bodenlehenstraße 11 ++43-699-11108907 >>> > | A-5500 Bischofshofen >>> >>> >>> >>> -- >>> | Rupert Westenthaler [email protected] >>> | Bodenlehenstraße 11 ++43-699-11108907 >>> | A-5500 Bischofshofen >>> >> >> >
