Dear all,

I came across a problem in my local Korean language version of DBpedia
Spotlight and might need your help for a better understanding. Currently in
the disambiguation step Spotlight seems to find best matching entities for
all detected tokens in input text, however does not disambiguate between
entities with shared offsets in the text. To give a simple example
translated to English language:

"President Barack Obama"

The text is tokenized to:
President, Barack, Obama, President Barack, Barack Obama, Obama, President
Barack Obama

Spotlight would now correctly detect and disambiguate the input token
"President", let's say correctly decide between the following DBpedia URIs:
http://dbpedia.org/page/President_of_the_United_States
http://dbpedia.org/page/President_of_Germany

It would likewise detect the DBpedia resource for token "President Barack
Obama" to be:
http://dbpedia.org/page/Presidency_of_Barack_Obama

The same accounts for all other tokens.

Spotlight however fails to realize that further disambiguation needs to
take place between the detected entities for "President" and "President
Barack Obama" which cannot be both annotated in the output since they
partly share the same text offset.

I am pretty sure this bug is related to my changing the tokenization
approach from Lingpipe to Lucene, however cannot see where the error comes
from. Could you help me out by telling which class in Spotlight is
originally responsible for the second disambiguation step mentioned?

Thanks a lot
David Müller
------------------------------------------------------------------------------
This SF email is sponsosred by:
Try Windows Azure free for 90 days Click Here 
http://p.sf.net/sfu/sfd2d-msazure
_______________________________________________
Dbp-spotlight-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbp-spotlight-users

Reply via email to