[ 
https://issues.apache.org/jira/browse/OPENNLP-579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14181275#comment-14181275
 ] 

Mark Giaconia commented on OPENNLP-579:
---------------------------------------

Joern, the doc text is passed in so that there is simple access to the full 
text, like in the GeoEntityLinker I use the text to discover "country context", 
if the text was not passed in I would have to reconstruct it using the tokens, 
which would be expensive when processing millions of documents... so it's a 
matter of convenience.
I think we should remove this method signature: List<T> find(String doctext, 
Span sentences[], Span tokens[], Span nameSpans[]); There is no way to really 
know which sentence the tokens and spans are for, which is why I created the 
other one that has a sentence index param.
Seems like with your suggestion of using 
List<T> find(String doctext, Span[][] tokensBySentence, Span[][] 
namesBySentence);
we could eliminate these:

List<T> find(String doctext, Span sentences[], Span tokens[], Span nameSpans[]);
List<T> find(String doctext, Span sentences[], Span tokens[], Span nameSpans[], 
int sentenceIndex);
List<T> find(String doctext, Span sentences[], String tokens[], Span 
nameSpans[]);

so we would be left with

List<T> find(String doctext, Span[] sentences, String[][] tokensBySentence, 
Span[][] namesBySentence); 
for the case when you need to get the sentences out of the text downstream for 
some reasonm hence the sentences span[].
and
List<T> find(String doctext, Span[][] tokensBySentence, Span[][] 
namesBySentence);
for the case when you don't need to get sentence text out downstream (from 
"deep" within an EntityLinkerImpl)

thoughts?

> Framework to dynamically link N-best matches from external data to named 
> entities by type (EntityLinker framework)
> ------------------------------------------------------------------------------------------------------------------
>
>                 Key: OPENNLP-579
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-579
>             Project: OpenNLP
>          Issue Type: Wish
>          Components: Entity Linker
>    Affects Versions: 1.6.0
>         Environment: Any
>            Reporter: Mark Giaconia
>            Assignee: Joern Kottmann
>            Priority: Minor
>              Labels: features
>             Fix For: 1.6.0
>
>         Attachments: entitylinker.properties, 
> opennlp.geoentitylinker.countrycontext.txt
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> A framework for integrating/linking external data to named entities. For 
> instance, geocoding or georeferencing location entities to geonames gazateers 
> can be implemented as an EntityLinker. Initially created ticket to 
> specifically solve the georeferencing/geolocating/geotagging problem, but the 
> framework should allow linkage of any external data to any entity type. 
> Commercial applications that do this are expensive, and there are many free 
> gazateers one could use to create solutions with OpenNLP. 
> UPDATE: The current implementation of the GeoEntityLinker uses Lucene to 
> store the Gazateers, and provides utils for indexing them. The impl returns 
> lat, long (and other gaz fields) for toponyms extracted with NER.
> All extracted toponyms are scored in four ways: fuzzy string matching, 
> binning by location, context modeling, and country-mention proximity. These 
> scores enable a good means of deciding what's worth keeping from the gaz.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to