Hi Joseph, What engines do you use for NLP processing of French texts? OpenNLP has no models for French, so if you just configure those engines you will have tokens, but no detected Sentences, POS tags nor NER annotations. In this case the EntityhubLinkingEngine falls back to linking all Tokens of the Text that do have >= "Min Search Token Length" (default = 3) with the text. So assuming that your configuration of the EnhancementChain is like described "plombier" and "moustachu" should be linked with the vocabulary.
BTW: If you are interested in processing French texts with Stanbol you should consider to use the Stanbol Talismane integration [1] Problems can also arise with very short texts (1) because the language might not be correctly detected and (2) POS and NER annotations do not work very well in such scenarios. So please check what language was detected for your input. If it was one of classified as one of the supported one (e.g. pt) you might also get unexpected results. Regarding the matching of skos:altLabel: The EntityhubLinkingEngine links only to a single field. By default this is set to rdfs:label. If you want to match against both skos:prefLabel and skos:altLabel, than there are two possibilities (1) copy the values of both skos:prefLabel and skos:altLabel to rdfs:label and configure rdfs:label for the engine (2) configure two instances of the EnityhubLinkingEngine: one for skos:prefLabel and the other for the skos:altLabel. If you want to know what happens ... (1) you can configure a Logger configuration to set the logger level for "org.apache.stanbol.enhancer.engines.entitylinking" to DEBUG. For that go the the "Configuration" tab of the Felix Web Console and add a new "Apache Sling Logging Logger Configuration". In DEBUG level the detailed information about the linking process are printed to the log. (2) if you want detailed information about the NLP processing results to be added to the enhancement results you can add the nlp2rdf enhancement engine to your Stanbol instance and your enhancement chain. For that you first need to install the bundle of this engine to the Stanbol environment (e.g. by using the Bundles tab of the Felix Webconsole) and after that add the engine to your chain configuration. This Engine does write detailed information about the NLP processing results. You can test it on [2] On Wed, Apr 17, 2013 at 4:16 PM, Joseph M'Bimbi-Bene <[email protected]> wrote: > Also, why is "le plombier moustachu" recognized ? why is there a difference > ? No Idea. Maybe the detected language does change by adding a word. > > Another related question is: what is the pos type of a token when i > deactivate the POStagging ? Than there are simple no POS annotations and the length of the words is used to decide if they are linked or not. Note that regardless of that upper case words do always trigger searches in the linked vocabulary. > Are they all proper noun ? what happens ? how can i parameter that ? The EntityLinking engine distinguishes * Linkable Tokens: This are words that are linked with the Vocabulary. This means that the engine will issue quires in the controlled vocabulary for those tokens * Matchable Tokens: Matchable tokens are used to refine quires. For the matching of entity labels with the text those words are treated in the same way as linkable words. So the main difference is that matchable words alone will not cause the engine to query for Entities in the Controlled Vocabulary. * Other Tokens: All other tokens in the text are not used for searches in the configured vocabulary. However during the matching of labels with the Text they are considered as they might also be present in labels of entities The rules for classifying words as Linkable and Matchable can be controlled by the configuration of the EntiyLinkingEngine. You can find details about that in the documentation at [3] best Rupert [1] https://github.com/westei/stanbol-talismane [2] http://dev.iks-project.eu:8081/enhancer/chain/NIF-demo [3] http://stanbol.apache.org/docs/trunk/components/enhancer/engines/entitylinking#linking-process -- | Rupert Westenthaler [email protected] | Bodenlehenstraße 11 ++43-699-11108907 | A-5500 Bischofshofen
