[jira] [Comment Edited] (OPENNLP-659) Language models

Tommaso Teofili (JIRA) Fri, 05 Feb 2016 07:52:59 -0800

    [ 
https://issues.apache.org/jira/browse/OPENNLP-659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15134232#comment-15134232
 ]


Tommaso Teofili edited comment on OPENNLP-659 at 2/5/16 3:50 PM:
-----------------------------------------------------------------

new patch: removed dummy models from previous patches and kept only 
{{NgramLanguageModel}} (using Laplace smoothing by default and Stupid Backoff 
when the no. of ngrams is 1M+), added the {{LanguageModel#predictNextTokens}} 
API.

[~mwunderlich] it'd be good to know if the below API satisfies your needs:

{code}
public interface LanguageModel {

  /**
   * Calculate the probability of a series of tokens (e.g. a sentence), given a 
vocabulary
   *
   * @param tokens the text tokens to calculate the probability for
   * @return the probability of the given text tokens in the vocabulary
   */
  double calculateProbability(StringList tokens);

  /**
   * Predict the most probable output sequence of tokens, given an input 
sequence of tokens
   *
   * @param tokens a sequence of tokens
   * @return the most probable subsequent token sequence
   */
  StringList predictNextTokens(StringList tokens);

{code}


was (Author: teofili):
new patch: removed dummy models from previous patches and kept only 
{{NgramLanguageModel}} (using Laplace smoothing by default and Stupid Backoff 
when the no. of ngrams is 1M+), added the {{LanguageModel#predictNextTokens}} 
API.

[~mwunderlich] it'd be good to know if the below API satisfies your needs:

{code:java}
public interface LanguageModel {

  /**
   * Calculate the probability of a series of tokens (e.g. a sentence), given a 
vocabulary
   *
   * @param tokens the text tokens to calculate the probability for
   * @return the probability of the given text tokens in the vocabulary
   */
  double calculateProbability(StringList tokens);

  /**
   * Predict the most probable output sequence of tokens, given an input 
sequence of tokens
   *
   * @param tokens a sequence of tokens
   * @return the most probable subsequent token sequence
   */
  StringList predictNextTokens(StringList tokens);
{code:java}

> Language models
> ---------------
>
>                 Key: OPENNLP-659
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-659
>             Project: OpenNLP
>          Issue Type: New Feature
>    Affects Versions: tools-1.5.3
>         Environment: all
>            Reporter: Martin Wunderlich
>            Assignee: Tommaso Teofili
>            Priority: Minor
>              Labels: features, language, model
>         Attachments: OPENNLP-659.0.patch, OPENNLP-659.1.patch, 
> OPENNLP-659.2.patch
>
>   Original Estimate: 7m
>  Remaining Estimate: 7m
>
> This feature request is for inclusion of n-gramm language models in OpenNLP. 
> The language models could either be preconstructed from existing corpora for 
> various languages or they could be built by the user based on sample texts. 
> There should be unigram, bigram and trigram LMs at least, with absolute and 
> relative frequencies for each n-gram. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (OPENNLP-659) Language models

Reply via email to