kinow commented on code in PR #445:
URL: https://github.com/apache/opennlp/pull/445#discussion_r1038965583
##########
opennlp-tools/src/main/java/opennlp/tools/langdetect/LanguageDetector.java:
##########
@@ -20,14 +20,29 @@
import java.io.Serializable;
/**
- * The interface for LanguageDetector which provide the @{@link Language}
according to the context.
+ * The interface for {@link LanguageDetector} which predicts the {@link
Language} for a context.
*/
public interface LanguageDetector extends Serializable {
+ /**
+ * Predicts the {@link Language languages} for the full {@code content}
length.
+ *
+ * @param content The textual content to detect potential {@link Language
languages} from.
+ * @return the predicted languages
+ */
Language[] predictLanguages(CharSequence content);
+ /**
+ * Predicts the {@link Language} for the full {@code content} length.
+ *
+ * @param content The textual content to detect potential {@link Language
languages} from.
+ * @return the language with the highest confidence
+ */
Language predictLanguage(CharSequence content);
+ /**
+ * @return Retrieves an array of language (codes) that are supported by a
{@link LanguageDetector}.
Review Comment:
Some @return start with upper case, others don't. Don't really bother me,
but just in case there is a convention or intention to stardardize it 👍 (no
need to change anything if you don't want too, really).
##########
opennlp-tools/src/main/java/opennlp/tools/languagemodel/NGramLanguageModel.java:
##########
@@ -114,16 +135,30 @@ public StringList predictNextTokens(StringList tokens) {
return token;
}
+ private double calculateProbability(StringList tokens) {
+ double probability = 0d;
+ if (size() > 0) {
+ for (StringList ngram : NGramUtils.getNGrams(tokens, n)) {
+ double score = stupidBackoff(ngram);
+ probability += StrictMath.log(score);
+ if (Double.isNaN(probability)) {
+ probability = 0d;
+ break;
+ }
+ }
+ probability = StrictMath.exp(probability);
+ }
+ return probability;
+ }
+
@Override
public String[] predictNextTokens(String... tokens) {
double maxProb = Double.NEGATIVE_INFINITY;
String[] token = null;
for (StringList ngram : this) {
String[] sequence = new String[ngram.size() + tokens.length];
- for (int i = 0; i < tokens.length; i++) {
- sequence[i] = tokens[i];
- }
+ System.arraycopy(tokens, 0, sequence, 0, tokens.length);
Review Comment:
👏
##########
opennlp-tools/src/main/java/opennlp/tools/languagemodel/NGramLanguageModel.java:
##########
@@ -91,6 +111,7 @@ public double calculateProbability(String... tokens) {
}
@Override
+ @Deprecated
Review Comment:
Is there a recommendation on what to do if calling this deprecated method?
What to use instead?
##########
opennlp-tools/src/main/java/opennlp/tools/langdetect/DefaultLanguageDetectorContextGenerator.java:
##########
@@ -34,11 +34,12 @@ public class DefaultLanguageDetectorContextGenerator
implements LanguageDetector
protected final CharSequenceNormalizer normalizer;
/**
- * Creates a customizable @{@link DefaultLanguageDetectorContextGenerator}
that computes ngrams from text
- * @param minLength min ngrams chars
- * @param maxLength max ngrams chars
- * @param normalizers zero or more normalizers to
- * be applied in to the text before extracting ngrams
+ * Creates a customizable {@link DefaultLanguageDetectorContextGenerator}
that computes ngrams from text.
+ *
+ * @param minLength The min number of ngrams characters. Must be greater
than {@code 0}.
+ * @param maxLength The max number of ngrams characters. Must be greater
than {@code 0}
+ * and must be greater than {@code minLength}.
+ * @param normalizers zero or more normalizers to be applied in to the text
before extracting ngrams.
Review Comment:
Upper case Z?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]