[
https://issues.apache.org/jira/browse/OPENNLP-659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15134232#comment-15134232
]
Tommaso Teofili edited comment on OPENNLP-659 at 2/5/16 3:50 PM:
-----------------------------------------------------------------
new patch: removed dummy models from previous patches and kept only
{{NgramLanguageModel}} (using Laplace smoothing by default and Stupid Backoff
when the no. of ngrams is 1M+), added the {{LanguageModel#predictNextTokens}}
API.
[~mwunderlich] it'd be good to know if the below API satisfies your needs:
{code}
public interface LanguageModel {
/**
* Calculate the probability of a series of tokens (e.g. a sentence), given a
vocabulary
*
* @param tokens the text tokens to calculate the probability for
* @return the probability of the given text tokens in the vocabulary
*/
double calculateProbability(StringList tokens);
/**
* Predict the most probable output sequence of tokens, given an input
sequence of tokens
*
* @param tokens a sequence of tokens
* @return the most probable subsequent token sequence
*/
StringList predictNextTokens(StringList tokens);
{code}
was (Author: teofili):
new patch: removed dummy models from previous patches and kept only
{{NgramLanguageModel}} (using Laplace smoothing by default and Stupid Backoff
when the no. of ngrams is 1M+), added the {{LanguageModel#predictNextTokens}}
API.
[~mwunderlich] it'd be good to know if the below API satisfies your needs:
{code:java}
public interface LanguageModel {
/**
* Calculate the probability of a series of tokens (e.g. a sentence), given a
vocabulary
*
* @param tokens the text tokens to calculate the probability for
* @return the probability of the given text tokens in the vocabulary
*/
double calculateProbability(StringList tokens);
/**
* Predict the most probable output sequence of tokens, given an input
sequence of tokens
*
* @param tokens a sequence of tokens
* @return the most probable subsequent token sequence
*/
StringList predictNextTokens(StringList tokens);
{code:java}
> Language models
> ---------------
>
> Key: OPENNLP-659
> URL: https://issues.apache.org/jira/browse/OPENNLP-659
> Project: OpenNLP
> Issue Type: New Feature
> Affects Versions: tools-1.5.3
> Environment: all
> Reporter: Martin Wunderlich
> Assignee: Tommaso Teofili
> Priority: Minor
> Labels: features, language, model
> Attachments: OPENNLP-659.0.patch, OPENNLP-659.1.patch,
> OPENNLP-659.2.patch
>
> Original Estimate: 7m
> Remaining Estimate: 7m
>
> This feature request is for inclusion of n-gramm language models in OpenNLP.
> The language models could either be preconstructed from existing corpora for
> various languages or they could be built by the user based on sample texts.
> There should be unigram, bigram and trigram LMs at least, with absolute and
> relative frequencies for each n-gram.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)