Hi Rupert, hi all,

thanks to your hints I was able to track down to the problem. First, I
checked the engine name and the file location and both were correct (yes, I
did not write the correct name I used in the original post, I am sorry for
that). The file was found correctly. Still, it wasn't working.

What got me on the right track was:


> 15.05.2014 10:38:28.739 *INFO* [DataFileTrackingDaemon]

org.apache.stanbol.enhancer.engines.opennlp.impl.CustomNERModelEnhancementEngine
> register custom NameFinderModel from resource: geonames-ner.bin for
> language: en to NamedModelFileListener (name:opennlp-ner)
>
> in the logs.
>
and the fact, that the geonames-ner always only ran only for 1ms (which is
really fast, given the 5 megabyte model it has to work through).

Problem is, that my texts send to the chain are quite short, only one
sentence usually and they often contain some obviously non-english name
like "Costa de Xurius". This confuses the language detection, which does
not output english anymore but rather spanish in this example. Afterwards,
the geonames-ner engine does not even bother to run because the text is not
in a language it was trained for.

So, what's the right way to do it now? Can I somehow force the chain to
emit english as the language of the text? Removing the langdetect engine
does not work, as it is needed by the custom ner model engine.

----
Furthermore, I am not satisfied with the geonames.org entity linking.
Even when the text is correctly classified as english and the location
entity is found, the geonames linking can't link many entities.
Example:
The text snippet is "University of Buenos Aires". This is the exact name of
the entity on geonames.org. Still, I had to lower the confidence score to
20% to have the geonames engine find the link (confidence: 24%). Many
entities are not even found, even when I use the exact name as on
geonames.org and it is correctly identified as a location.

Where can I look into to increase the linking performance?

Best,
Stefan

Reply via email to