Hi Stefan On Sat, May 17, 2014 at 3:49 PM, Stefan Bunk <stefan.b...@student.hpi.uni-potsdam.de> wrote: > Problem is, that my texts send to the chain are quite short, only one > sentence usually and they often contain some obviously non-english name > like "Costa de Xurius". This confuses the language detection, which does > not output english anymore but rather spanish in this example. Afterwards, > the geonames-ner engine does not even bother to run because the text is not > in a language it was trained for. > > So, what's the right way to do it now? Can I somehow force the chain to > emit english as the language of the text? Removing the langdetect engine > does not work, as it is needed by the custom ner model engine. >
This remembers me on STANBOL-660 that is about exactly this problem. Was not affected by it for some time so I totally forgot about it. I scheduled this issue to be fixed with 0.12.1 and 1.0.0. Will try to implement this later today. When this is implemented you can parse the language via the Content-Language header and remove the LanguageDetection engine from your chain. > ---- > Furthermore, I am not satisfied with the geonames.org entity linking. > Even when the text is correctly classified as english and the location > entity is found, the geonames linking can't link many entities. > Example: > The text snippet is "University of Buenos Aires". This is the exact name of > the entity on geonames.org. Still, I had to lower the confidence score to > 20% to have the geonames engine find the link (confidence: 24%). Many > entities are not even found, even when I use the exact name as on > geonames.org and it is correctly identified as a location. > > Where can I look into to increase the linking performance? > I think STANBOL-1303 is the reason for the unexpected confidence values. You can try using the Entityhub Indexing Tool for Geonames (entityhub/indexing/geonames) to generate your own local index for Geonames. After installing this index to the Stanbol Entityhub you can used the Named Entity Linking Engine [1] for entity linking. This would also have the advantage that you do not depend on an external service for linking. You can use one of the genomes indexes available at [2] for testing. Those are based on a geonames.org dump that is about 1 year old. best Rupert [1] http://stanbol.apache.org/docs/trunk/components/enhancer/engines/namedentitytaggingengine [2] http://dev.iks-project.eu/downloads/stanbol-indices/geonames/ -- | Rupert Westenthaler rupert.westentha...@gmail.com | Bodenlehenstraße 11 ++43-699-11108907 | A-5500 Bischofshofen | REDLINK.CO .......................................................................... | http://redlink.co/