Hi Stefan, STANBOL-660 [1] is now resolved (both 0.12.1 and 1.0.0) - so you can now explicitly parse the language of the parsed content by using the Content-Language header.
best Rupert [1] https://issues.apache.org/jira/browse/STANBOL-660 On Mon, May 19, 2014 at 8:27 AM, Rupert Westenthaler <rupert.westentha...@gmail.com> wrote: > Hi Stefan > > On Sat, May 17, 2014 at 3:49 PM, Stefan Bunk > <stefan.b...@student.hpi.uni-potsdam.de> wrote: >> Problem is, that my texts send to the chain are quite short, only one >> sentence usually and they often contain some obviously non-english name >> like "Costa de Xurius". This confuses the language detection, which does >> not output english anymore but rather spanish in this example. Afterwards, >> the geonames-ner engine does not even bother to run because the text is not >> in a language it was trained for. >> >> So, what's the right way to do it now? Can I somehow force the chain to >> emit english as the language of the text? Removing the langdetect engine >> does not work, as it is needed by the custom ner model engine. >> > > This remembers me on STANBOL-660 that is about exactly this problem. > Was not affected by it for some time so I totally forgot about it. > I scheduled this issue to be fixed with 0.12.1 and 1.0.0. Will try to > implement this later today. > > When this is implemented you can parse the language via the > Content-Language header and remove the LanguageDetection engine from > your chain. > >> ---- >> Furthermore, I am not satisfied with the geonames.org entity linking. >> Even when the text is correctly classified as english and the location >> entity is found, the geonames linking can't link many entities. >> Example: >> The text snippet is "University of Buenos Aires". This is the exact name of >> the entity on geonames.org. Still, I had to lower the confidence score to >> 20% to have the geonames engine find the link (confidence: 24%). Many >> entities are not even found, even when I use the exact name as on >> geonames.org and it is correctly identified as a location. >> >> Where can I look into to increase the linking performance? >> > > I think STANBOL-1303 is the reason for the unexpected confidence values. > > You can try using the Entityhub Indexing Tool for Geonames > (entityhub/indexing/geonames) to generate your own local index for > Geonames. After installing this index to the Stanbol Entityhub you can > used the Named Entity Linking Engine [1] for entity linking. This > would also have the advantage that you do not depend on an external > service for linking. > > You can use one of the genomes indexes available at [2] for testing. > Those are based on a geonames.org dump that is about 1 year old. > > best > Rupert > > > > [1] > http://stanbol.apache.org/docs/trunk/components/enhancer/engines/namedentitytaggingengine > [2] http://dev.iks-project.eu/downloads/stanbol-indices/geonames/ > > -- > | Rupert Westenthaler rupert.westentha...@gmail.com > | Bodenlehenstraße 11 ++43-699-11108907 > | A-5500 Bischofshofen > | REDLINK.CO > .......................................................................... > | http://redlink.co/ -- | Rupert Westenthaler rupert.westentha...@gmail.com | Bodenlehenstraße 11 ++43-699-11108907 | A-5500 Bischofshofen | REDLINK.CO .......................................................................... | http://redlink.co/