Hi Stefan,

STANBOL-660 [1] is now resolved (both 0.12.1 and 1.0.0) - so you can
now explicitly parse the language of the parsed content by using the
Content-Language header.

best
Rupert

[1] https://issues.apache.org/jira/browse/STANBOL-660

On Mon, May 19, 2014 at 8:27 AM, Rupert Westenthaler
<rupert.westentha...@gmail.com> wrote:
> Hi Stefan
>
> On Sat, May 17, 2014 at 3:49 PM, Stefan Bunk
> <stefan.b...@student.hpi.uni-potsdam.de> wrote:
>> Problem is, that my texts send to the chain are quite short, only one
>> sentence usually and they often contain some obviously non-english name
>> like "Costa de Xurius". This confuses the language detection, which does
>> not output english anymore but rather spanish in this example. Afterwards,
>> the geonames-ner engine does not even bother to run because the text is not
>> in a language it was trained for.
>>
>> So, what's the right way to do it now? Can I somehow force the chain to
>> emit english as the language of the text? Removing the langdetect engine
>> does not work, as it is needed by the custom ner model engine.
>>
>
> This remembers me on STANBOL-660 that is about exactly this problem.
> Was not affected by it for some time so I totally forgot about it.
> I scheduled this issue to be fixed with 0.12.1 and 1.0.0. Will try to
> implement this later today.
>
> When this is implemented you can parse the language via the
> Content-Language header and remove the LanguageDetection engine from
> your chain.
>
>> ----
>> Furthermore, I am not satisfied with the geonames.org entity linking.
>> Even when the text is correctly classified as english and the location
>> entity is found, the geonames linking can't link many entities.
>> Example:
>> The text snippet is "University of Buenos Aires". This is the exact name of
>> the entity on geonames.org. Still, I had to lower the confidence score to
>> 20% to have the geonames engine find the link (confidence: 24%). Many
>> entities are not even found, even when I use the exact name as on
>> geonames.org and it is correctly identified as a location.
>>
>> Where can I look into to increase the linking performance?
>>
>
> I think STANBOL-1303 is the reason for the unexpected confidence values.
>
> You can try using the Entityhub Indexing Tool for Geonames
> (entityhub/indexing/geonames) to generate your own local index for
> Geonames. After installing this index to the Stanbol Entityhub you can
> used the Named Entity Linking Engine [1] for entity linking. This
> would also have the advantage that you do not depend on an external
> service for linking.
>
> You can use one of the genomes indexes available at [2] for testing.
> Those are based on a geonames.org dump that is about 1 year old.
>
> best
> Rupert
>
>
>
> [1] 
> http://stanbol.apache.org/docs/trunk/components/enhancer/engines/namedentitytaggingengine
> [2] http://dev.iks-project.eu/downloads/stanbol-indices/geonames/
>
> --
> | Rupert Westenthaler             rupert.westentha...@gmail.com
> | Bodenlehenstraße 11                              ++43-699-11108907
> | A-5500 Bischofshofen
> | REDLINK.CO 
> ..........................................................................
> | http://redlink.co/



-- 
| Rupert Westenthaler             rupert.westentha...@gmail.com
| Bodenlehenstraße 11                              ++43-699-11108907
| A-5500 Bischofshofen
| REDLINK.CO 
..........................................................................
| http://redlink.co/

Reply via email to