Hello, thank you for your fast answer I use langdetect + opennlp-token + myEngine. I know french is not supported and use default tokenizer on purpose. Since i'm using a custom vocabulary and some of the labels of my entities are on several words and i also want to detect scribal abbreviation (thanks you wikipedia), i dont' want want any chunking or phrase segmentation, and i think POS tagging is just going to annoy me since i have no idea what POS tag will be afftected to some of the labels
2013/4/18 Rupert Westenthaler <[email protected]> > Hi Joseph, > > What engines do you use for NLP processing of French texts? OpenNLP > has no models for French, so if you just configure those engines you > will have tokens, but no detected Sentences, POS tags nor NER > annotations. In this case the EntityhubLinkingEngine falls back to > linking all Tokens of the Text that do have >= "Min Search Token > Length" (default = 3) with the text. So assuming that your > configuration of the EnhancementChain is like described "plombier" and > "moustachu" should be linked with the vocabulary. > > BTW: If you are interested in processing French texts with Stanbol you > should consider to use the Stanbol Talismane integration [1] > > Problems can also arise with very short texts (1) because the language > might not be correctly detected and (2) POS and NER annotations do not > work very well in such scenarios. So please check what language was > detected for your input. If it was one of classified as one of the > supported one (e.g. pt) you might also get unexpected results. > > Regarding the matching of skos:altLabel: The EntityhubLinkingEngine > links only to a single field. By default this is set to rdfs:label. If > you want to match against both skos:prefLabel and skos:altLabel, than > there are two possibilities (1) copy the values of both skos:prefLabel > and skos:altLabel to rdfs:label and configure rdfs:label for the > engine (2) configure two instances of the EnityhubLinkingEngine: one > for skos:prefLabel and the other for the skos:altLabel. > > If you want to know what happens ... > > (1) you can configure a Logger configuration to set the logger level > for "org.apache.stanbol.enhancer.engines.entitylinking" to DEBUG. For > that go the the "Configuration" tab of the Felix Web Console and add a > new "Apache Sling Logging Logger Configuration". In DEBUG level the > detailed information about the linking process are printed to the log. > > (2) if you want detailed information about the NLP processing results > to be added to the enhancement results you can add the nlp2rdf > enhancement engine to your Stanbol instance and your enhancement > chain. For that you first need to install the bundle of this engine to > the Stanbol environment (e.g. by using the Bundles tab of the Felix > Webconsole) and after that add the engine to your chain configuration. > This Engine does write detailed information about the NLP processing > results. You can test it on [2] > > > On Wed, Apr 17, 2013 at 4:16 PM, Joseph M'Bimbi-Bene > <[email protected]> wrote: > > Also, why is "le plombier moustachu" recognized ? why is there a > difference > > ? > > No Idea. Maybe the detected language does change by adding a word. > > > > > Another related question is: what is the pos type of a token when i > > deactivate the POStagging ? > > Than there are simple no POS annotations and the length of the words > is used to decide if they are linked or not. Note that regardless of > that upper case words do always trigger searches in the linked > vocabulary. > > > Are they all proper noun ? what happens ? how can i parameter that ? > > The EntityLinking engine distinguishes > > * Linkable Tokens: This are words that are linked with the Vocabulary. > This means that the engine will issue quires in the controlled > vocabulary for those tokens > * Matchable Tokens: Matchable tokens are used to refine quires. For > the matching of entity labels with the text those words are treated in > the same way as linkable words. So the main difference is that > matchable words alone will not cause the engine to query for Entities > in the Controlled Vocabulary. > * Other Tokens: All other tokens in the text are not used for searches > in the configured vocabulary. However during the matching of labels > with the Text they are considered as they might also be present in > labels of entities > > The rules for classifying words as Linkable and Matchable can be > controlled by the configuration of the EntiyLinkingEngine. You can > find details about that in the documentation at [3] > > best > Rupert > > > [1] https://github.com/westei/stanbol-talismane > [2] http://dev.iks-project.eu:8081/enhancer/chain/NIF-demo > [3] > http://stanbol.apache.org/docs/trunk/components/enhancer/engines/entitylinking#linking-process > > -- > | Rupert Westenthaler [email protected] > | Bodenlehenstraße 11 ++43-699-11108907 > | A-5500 Bischofshofen >
