We are now one iteration further. In this new version it is
possible to pass in a document at once. Which leads
to the question on how we should handle this in OpenNLP generally.

To pass in a document the following information needs to be handed over:
- Sentences
- Tokens
- Names

And maybe a the text depending on if the tokens are Spans or Strings.

If the component is stateless all this needs to handed over in one method call, otherwise it could handed over on a per sentences basis (thats how coref is doing it).

The DocumentNameFinder (never implemented, but interface is defined) its done
like this:
Span[][] find(String tokens[][])

In my opinion thats not a nice solution, it first requires that the input text gets split into Strings and second its hard to use the returned Spans, they are only meaningful within the context which is given by the returned array. Names which cross sentences are not possible.

Another approach could be that:
Span[] find(String text, Span sentences[], Span tokens[])

Where the sentence and token offsets in the spans are character offsets, and
the returned spans or token offsets.

It would probably be nicer to use token offsets for the sentences as well, but thats
currently incompatible with the sentence detector interface.

Any opinions on how we should solve this?

Jörn

On 05/23/2013 03:04 PM, Jörn Kottmann wrote:
Hi all,

please have a look at
https://issues.apache.org/jira/browse/OPENNLP-579

Its about a contribution to link location entities to a geo name database, the component could later be extended to link other entity types as well to
a database or dictionary.

Thanks,
Jörn

Reply via email to