[
https://issues.apache.org/jira/browse/OPENNLP-579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14181275#comment-14181275
]
Mark Giaconia commented on OPENNLP-579:
---------------------------------------
Joern, the doc text is passed in so that there is simple access to the full
text, like in the GeoEntityLinker I use the text to discover "country context",
if the text was not passed in I would have to reconstruct it using the tokens,
which would be expensive when processing millions of documents... so it's a
matter of convenience.
I think we should remove this method signature: List<T> find(String doctext,
Span sentences[], Span tokens[], Span nameSpans[]); There is no way to really
know which sentence the tokens and spans are for, which is why I created the
other one that has a sentence index param.
Seems like with your suggestion of using
List<T> find(String doctext, Span[][] tokensBySentence, Span[][]
namesBySentence);
we could eliminate these:
List<T> find(String doctext, Span sentences[], Span tokens[], Span nameSpans[]);
List<T> find(String doctext, Span sentences[], Span tokens[], Span nameSpans[],
int sentenceIndex);
List<T> find(String doctext, Span sentences[], String tokens[], Span
nameSpans[]);
so we would be left with
List<T> find(String doctext, Span[] sentences, String[][] tokensBySentence,
Span[][] namesBySentence);
for the case when you need to get the sentences out of the text downstream for
some reasonm hence the sentences span[].
and
List<T> find(String doctext, Span[][] tokensBySentence, Span[][]
namesBySentence);
for the case when you don't need to get sentence text out downstream (from
"deep" within an EntityLinkerImpl)
thoughts?
> Framework to dynamically link N-best matches from external data to named
> entities by type (EntityLinker framework)
> ------------------------------------------------------------------------------------------------------------------
>
> Key: OPENNLP-579
> URL: https://issues.apache.org/jira/browse/OPENNLP-579
> Project: OpenNLP
> Issue Type: Wish
> Components: Entity Linker
> Affects Versions: 1.6.0
> Environment: Any
> Reporter: Mark Giaconia
> Assignee: Joern Kottmann
> Priority: Minor
> Labels: features
> Fix For: 1.6.0
>
> Attachments: entitylinker.properties,
> opennlp.geoentitylinker.countrycontext.txt
>
> Original Estimate: 672h
> Remaining Estimate: 672h
>
> A framework for integrating/linking external data to named entities. For
> instance, geocoding or georeferencing location entities to geonames gazateers
> can be implemented as an EntityLinker. Initially created ticket to
> specifically solve the georeferencing/geolocating/geotagging problem, but the
> framework should allow linkage of any external data to any entity type.
> Commercial applications that do this are expensive, and there are many free
> gazateers one could use to create solutions with OpenNLP.
> UPDATE: The current implementation of the GeoEntityLinker uses Lucene to
> store the Gazateers, and provides utils for indexing them. The impl returns
> lat, long (and other gaz fields) for toponyms extracted with NER.
> All extracted toponyms are scored in four ways: fuzzy string matching,
> binning by location, context modeling, and country-mention proximity. These
> scores enable a good means of deciding what's worth keeping from the gaz.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)