mawiesne commented on code in PR #1057:
URL: https://github.com/apache/opennlp/pull/1057#discussion_r3362534165


##########
opennlp-extensions/opennlp-spellcheck/src/main/java/opennlp/spellcheck/distance/DamerauOSADistance.java:
##########
@@ -0,0 +1,151 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package opennlp.spellcheck.distance;
+
+/**
+ * Optimal String Alignment (restricted Damerau-Levenshtein) edit distance.
+ *
+ * <p>Counts insertions, deletions and substitutions, plus transpositions of 
two
+ * adjacent symbols, each with a unit cost. As an optimal-string-alignment 
metric, a
+ * given substring may not be edited more than once, which is the variant used 
by the
+ * SymSpell reference implementation.</p>
+ *
+ * <p>This is the {@linkplain #INSTANCE default} edit distance for the engine. 
It is
+ * Unicode-aware: comparison happens on Unicode code points, so characters 
outside the
+ * Basic Multilingual Plane (e.g. many emoji) are treated as single 
symbols.</p>
+ *
+ * <p>Instances are immutable and thread-safe. A bounded computation with 
early exit is
+ * provided through {@link #distance(CharSequence, CharSequence, int)}.</p>
+ */
+public final class DamerauOSADistance implements EditDistance {
+
+  /** Shared, stateless instance. */
+  public static final DamerauOSADistance INSTANCE = new DamerauOSADistance();
+
+  public DamerauOSADistance() {
+  }
+
+  @Override
+  public int distance(CharSequence a, CharSequence b, int max) {
+    if (a == null || b == null) {
+      throw new NullPointerException("input sequences must not be null");
+    }
+    if (max < 0) {

Review Comment:
   See below in LevenshteinDistance for meaningful values of max parameter.



##########
opennlp-extensions/opennlp-spellcheck/src/main/java/opennlp/spellcheck/stream/SpellCorrectingTokenStream.java:
##########
@@ -0,0 +1,131 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package opennlp.spellcheck.stream;
+
+import java.io.IOException;
+import java.util.Objects;
+import java.util.regex.Pattern;
+
+import opennlp.spellcheck.SpellChecker;
+import opennlp.spellcheck.dictionary.SymSpellModel;
+import opennlp.spellcheck.normalizer.SpellCheckingCharSequenceNormalizer;
+import opennlp.tools.util.FilterObjectStream;
+import opennlp.tools.util.ObjectStream;
+
+/**
+ * A {@link FilterObjectStream} for <em>tokenized</em> data: each element read 
from the
+ * wrapped {@link ObjectStream} is a string of tokens separated by a known 
delimiter
+ * (whitespace by default). Every token is spell-corrected independently and 
the tokens
+ * are re-joined with the same delimiter.
+ *
+ * <p>This is the shape produced by OpenNLP tokenizers / token-sample formats 
and is
+ * what the trainable components consume: a fixed sequence of tokens per 
element. Unlike
+ * {@link SpellCorrectingObjectStream} in compound mode, this stream is
+ * <em>token-count preserving</em> &ndash; it never splits or merges tokens, 
so the
+ * corrected element stays aligned with any parallel annotation (tags, 
spans).</p>
+ *
+ * <p>Correction always runs in
+ * {@link SpellCheckingCharSequenceNormalizer.Mode#PER_TOKEN per-token} mode 
and reuses
+ * the normalizer's guards (minimum length, skip numbers/URLs, never change a 
word the
+ * dictionary already contains) and its casing preservation.</p>
+ *
+ * <p>{@code null} (end of stream) is forwarded unchanged; {@link #reset()} and
+ * {@link #close()} delegate to the wrapped stream.</p>
+ */
+public class SpellCorrectingTokenStream extends FilterObjectStream<String, 
String> {
+
+  /** The default delimiter splitting and re-joining tokens (a single space). 
*/
+  public static final String DEFAULT_DELIMITER = " ";
+
+  private final SpellCheckingCharSequenceNormalizer normalizer;
+  private final String delimiter;
+  private final Pattern splitPattern;
+
+  /**
+   * Wraps {@code samples} with a default corrector ({@link #DEFAULT_DELIMITER 
space}
+   * delimited) backed by a {@link SpellChecker}.
+   *
+   * @param samples      the source token-line stream; must not be {@code null}
+   * @param spellChecker the engine used to correct tokens; must not be {@code 
null}
+   */
+  public SpellCorrectingTokenStream(ObjectStream<String> samples, SpellChecker 
spellChecker) {
+    this(samples,
+        SpellCheckingCharSequenceNormalizer.builder(
+            Objects.requireNonNull(spellChecker, "spellChecker must not be 
null"))
+            .mode(SpellCheckingCharSequenceNormalizer.Mode.PER_TOKEN).build(),
+        DEFAULT_DELIMITER);
+  }
+
+  /**
+   * Wraps {@code samples} with a default corrector ({@link #DEFAULT_DELIMITER 
space}
+   * delimited) backed by a loaded {@link SymSpellModel}.
+   *
+   * @param samples the source token-line stream; must not be {@code null}
+   * @param model   the loaded model whose engine is used; must not be {@code 
null}
+   */
+  public SpellCorrectingTokenStream(ObjectStream<String> samples, 
SymSpellModel model) {
+    this(samples, Objects.requireNonNull(model, "model must not be 
null").getSymSpell());
+  }
+
+  /**
+   * Wraps {@code samples} with an explicitly configured corrector and 
delimiter.
+   *
+   * <p>The corrector is forced into per-token mode regardless of how it was 
built, so
+   * the token count is always preserved.</p>
+   *
+   * @param samples    the source token-line stream; must not be {@code null}
+   * @param normalizer the corrector whose guards/config are reused; must not 
be
+   *                   {@code null}
+   * @param delimiter  the literal token delimiter to split and re-join on; 
must not be
+   *                   {@code null} or empty
+   */
+  public SpellCorrectingTokenStream(ObjectStream<String> samples,
+                                    SpellCheckingCharSequenceNormalizer 
normalizer,
+                                    String delimiter) {
+    super(samples);
+    this.normalizer = Objects.requireNonNull(normalizer, "normalizer must not 
be null");
+    Objects.requireNonNull(delimiter, "delimiter must not be null");
+    if (delimiter.isEmpty()) {
+      throw new IllegalArgumentException("delimiter must not be empty");

Review Comment:
   Add Javadoc `@throws` declaration for this case.



##########
opennlp-extensions/opennlp-spellcheck/src/main/java/opennlp/spellcheck/SpellChecker.java:
##########
@@ -0,0 +1,70 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package opennlp.spellcheck;
+
+import java.util.List;
+
+/**
+ * A spelling corrector that proposes {@link SuggestItem suggestions} for 
individual
+ * terms and corrects whole sentences.
+ *
+ * <p>Implementations are expected to be thread-safe for concurrent {@code 
lookup} calls
+ * once their dictionary has been fully populated <i>and safely published</i> 
to the reading
+ * threads (e.g. stored in a {@code final} field, or otherwise guarded so a 
happens-before
+ * edge exists between population and the first read). Population itself is 
not required to
+ * be thread-safe.</p>
+ */
+public interface SpellChecker {
+
+  /**
+   * Looks up suggestions for a single {@code term} within {@code 
maxEditDistance}.
+   *
+   * @param term           the (possibly misspelled) term to correct; must not 
be {@code null}
+   * @param verbosity      controls how many suggestions are returned
+   * @param maxEditDistance the maximum edit distance to consider; must not be 
negative and
+   *                       must not exceed {@link #maxEditDistance()}
+   * @return the matching suggestions in natural order (best first); never 
{@code null}
+   */
+  List<SuggestItem> lookup(String term, Verbosity verbosity, int 
maxEditDistance);

Review Comment:
   What is thrown in case parameter expectations are not met? 
IllegalArgumentException, I'd guess? Please add appropriate `@throws` Javadoc 
for this.



##########
opennlp-extensions/opennlp-spellcheck/src/main/java/opennlp/spellcheck/symspell/SymSpellConfig.java:
##########
@@ -0,0 +1,192 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package opennlp.spellcheck.symspell;
+
+import java.util.Objects;
+
+import opennlp.spellcheck.distance.DamerauOSADistance;
+import opennlp.spellcheck.distance.EditDistance;
+
+/**
+ * Immutable configuration for {@link SymSpell}, created through {@link 
#builder()}.
+ *
+ * <p>The tunables mirror the SymSpell reference implementation:</p>
+ * <ul>
+ *   <li><b>maxDictionaryEditDistance</b> &ndash; the largest edit distance 
for which the
+ *       deletes index is precomputed; queries cannot exceed this.</li>
+ *   <li><b>prefixLength</b> &ndash; only the first {@code prefixLength} 
symbols of each
+ *       term are used to generate deletes, trading index size for recall on 
long words.</li>
+ *   <li><b>countThreshold</b> &ndash; the minimum corpus count for a term to 
be indexed.</li>
+ *   <li><b>editDistance</b> &ndash; the verification metric, defaulting to
+ *       {@link DamerauOSADistance}.</li>
+ *   <li><b>corpusWordCount</b> &ndash; the corpus normalization constant 
<i>N</i> used by
+ *       the Naive-Bayes word combine/split scoring in
+ *       {@link SymSpell#lookupCompound(String, int)}. Defaults to
+ *       {@link #DERIVE_CORPUS_WORD_COUNT}, which makes the engine derive 
<i>N</i> from the
+ *       summed counts of the loaded dictionary so it is always 
corpus-correct; set it
+ *       explicitly to pin <i>N</i> (e.g. to the full-corpus total a reference 
dictionary was
+ *       drawn from).</li>
+ * </ul>
+ */
+public final class SymSpellConfig {
+
+  /**
+   * Sentinel for {@link #corpusWordCount()} meaning "derive <i>N</i> from the 
summed counts
+   * of the loaded dictionary" rather than pinning it to a fixed value.
+   */
+  public static final long DERIVE_CORPUS_WORD_COUNT = 0L;
+
+  private final int maxDictionaryEditDistance;
+  private final int prefixLength;
+  private final long countThreshold;
+  private final EditDistance editDistance;
+  private final long corpusWordCount;
+
+  private SymSpellConfig(Builder b) {
+    this.maxDictionaryEditDistance = b.maxDictionaryEditDistance;
+    this.prefixLength = b.prefixLength;
+    this.countThreshold = b.countThreshold;
+    this.editDistance = b.editDistance;
+    this.corpusWordCount = b.corpusWordCount;
+  }
+
+  public int maxDictionaryEditDistance() {
+    return maxDictionaryEditDistance;
+  }
+
+  public int prefixLength() {
+    return prefixLength;
+  }
+
+  public long countThreshold() {
+    return countThreshold;
+  }
+
+  public EditDistance editDistance() {
+    return editDistance;
+  }
+
+  /**
+   * @return the pinned corpus normalization constant <i>N</i>, or
+   *     {@link #DERIVE_CORPUS_WORD_COUNT} when <i>N</i> is derived from the 
loaded
+   *     dictionary's summed counts.
+   */
+  public long corpusWordCount() {
+    return corpusWordCount;
+  }
+
+  /**
+   * @return a builder with the SymSpell reference defaults
+   *     (maxDictionaryEditDistance=2, prefixLength=7, countThreshold=1,
+   *     editDistance={@link DamerauOSADistance}).
+   */
+  public static Builder builder() {
+    return new Builder();
+  }
+
+  /** @return a configuration with all reference defaults. */
+  public static SymSpellConfig defaultConfig() {
+    return builder().build();
+  }
+
+  /** Mutable builder for {@link SymSpellConfig}. */
+  public static final class Builder {
+
+    private int maxDictionaryEditDistance = 2;
+    private int prefixLength = 7;
+    private long countThreshold = 1;
+    private EditDistance editDistance = DamerauOSADistance.INSTANCE;
+    private long corpusWordCount = DERIVE_CORPUS_WORD_COUNT;
+
+    private Builder() {
+    }
+
+    /**
+     * @param value largest precomputed dictionary edit distance; must be 
{@code >= 0}
+     * @return this builder
+     */
+    public Builder maxDictionaryEditDistance(int value) {
+      if (value < 0) {
+        throw new IllegalArgumentException("maxDictionaryEditDistance must be 
>= 0: " + value);
+      }
+      this.maxDictionaryEditDistance = value;
+      return this;
+    }
+
+    /**
+     * @param value number of leading symbols used for delete generation; must 
be
+     *     {@code >= 1} and {@code > maxDictionaryEditDistance}
+     * @return this builder
+     */
+    public Builder prefixLength(int value) {
+      if (value < 1) {
+        throw new IllegalArgumentException("prefixLength must be >= 1: " + 
value);
+      }
+      this.prefixLength = value;
+      return this;
+    }
+
+    /**
+     * @param value minimum corpus count for a term to be indexed; must be 
{@code >= 1}
+     * @return this builder
+     */
+    public Builder countThreshold(long value) {
+      if (value < 1) {
+        throw new IllegalArgumentException("countThreshold must be >= 1: " + 
value);
+      }
+      this.countThreshold = value;
+      return this;
+    }
+
+    /**
+     * @param value verification metric to inject; must not be {@code null}
+     * @return this builder
+     */
+    public Builder editDistance(EditDistance value) {
+      this.editDistance = Objects.requireNonNull(value, "editDistance must not 
be null");
+      return this;
+    }
+
+    /**
+     * Pins the corpus normalization constant <i>N</i> used by the Naive-Bayes 
word
+     * combine/split scoring in {@link SymSpell#lookupCompound(String, int)}.
+     *
+     * @param value the corpus word count to pin, or {@link 
#DERIVE_CORPUS_WORD_COUNT} to
+     *              derive <i>N</i> from the loaded dictionary's summed 
counts; must be
+     *              {@code >= 0}
+     * @return this builder
+     */
+    public Builder corpusWordCount(long value) {
+      if (value < 0) {
+        throw new IllegalArgumentException("corpusWordCount must be >= 0: " + 
value);
+      }
+      this.corpusWordCount = value;
+      return this;
+    }
+
+    /** @return the immutable configuration. */
+    public SymSpellConfig build() {
+      if (prefixLength <= maxDictionaryEditDistance) {
+        throw new IllegalArgumentException(

Review Comment:
   Add Javadoc detailing `@throws `for IllegalArgumentException.



##########
opennlp-extensions/opennlp-spellcheck/src/main/java/opennlp/spellcheck/distance/LevenshteinDistance.java:
##########
@@ -0,0 +1,53 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package opennlp.spellcheck.distance;
+
+/**
+ * Plain Levenshtein edit distance (insertions, deletions, substitutions; no
+ * transpositions). This is a thin adapter over
+ * {@link org.apache.commons.text.similarity.LevenshteinDistance} and is 
offered as a
+ * selectable alternative to the default {@link DamerauOSADistance}.
+ *
+ * <p>It honors the bounded {@link EditDistance} contract by delegating to a
+ * threshold-aware Commons Text instance.</p>
+ *
+ * <p>Note that Commons Text computes distances over UTF-16 {@code char} 
units, so
+ * supplementary characters count as two symbols here. For full code-point 
correctness
+ * prefer {@link DamerauOSADistance}.</p>
+ */
+public final class LevenshteinDistance implements EditDistance {
+
+  /** Shared, stateless instance. */
+  public static final LevenshteinDistance INSTANCE = new LevenshteinDistance();
+
+  public LevenshteinDistance() {
+  }
+
+  @Override
+  public int distance(CharSequence a, CharSequence b, int max) {
+    if (a == null || b == null) {
+      throw new NullPointerException("input sequences must not be null");
+    }
+    if (max < 0) {
+      throw new IllegalArgumentException("max must not be negative: " + max);
+    }
+    // The threshold-aware Commons Text instance returns -1 when the distance 
exceeds
+    // the supplied threshold, which matches our contract exactly.
+    return new 
org.apache.commons.text.similarity.LevenshteinDistance(max).apply(a, b);

Review Comment:
   Do we need to rely on this dependency or could we just write the calculation 
code ourselves in OpenNLP?
   I'd prefer to skip commons-text here, if possible.



##########
opennlp-extensions/opennlp-spellcheck/src/main/java/opennlp/spellcheck/distance/LevenshteinDistance.java:
##########
@@ -0,0 +1,53 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package opennlp.spellcheck.distance;
+
+/**
+ * Plain Levenshtein edit distance (insertions, deletions, substitutions; no
+ * transpositions). This is a thin adapter over
+ * {@link org.apache.commons.text.similarity.LevenshteinDistance} and is 
offered as a
+ * selectable alternative to the default {@link DamerauOSADistance}.
+ *
+ * <p>It honors the bounded {@link EditDistance} contract by delegating to a
+ * threshold-aware Commons Text instance.</p>
+ *
+ * <p>Note that Commons Text computes distances over UTF-16 {@code char} 
units, so
+ * supplementary characters count as two symbols here. For full code-point 
correctness
+ * prefer {@link DamerauOSADistance}.</p>
+ */
+public final class LevenshteinDistance implements EditDistance {
+
+  /** Shared, stateless instance. */
+  public static final LevenshteinDistance INSTANCE = new LevenshteinDistance();
+
+  public LevenshteinDistance() {
+  }
+
+  @Override
+  public int distance(CharSequence a, CharSequence b, int max) {
+    if (a == null || b == null) {
+      throw new NullPointerException("input sequences must not be null");

Review Comment:
   This should throw `IllegalArgumentException` instead`, for consistency with 
the check on `max` down below and to express the problem appropriately.
   
   JavaDoc in the related interfaces `EditDistance` should mention those 
parameter checks and declare it via "@throws".



##########
opennlp-extensions/opennlp-spellcheck/src/main/java/opennlp/spellcheck/distance/LevenshteinDistance.java:
##########
@@ -0,0 +1,53 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package opennlp.spellcheck.distance;
+
+/**
+ * Plain Levenshtein edit distance (insertions, deletions, substitutions; no
+ * transpositions). This is a thin adapter over
+ * {@link org.apache.commons.text.similarity.LevenshteinDistance} and is 
offered as a
+ * selectable alternative to the default {@link DamerauOSADistance}.
+ *
+ * <p>It honors the bounded {@link EditDistance} contract by delegating to a
+ * threshold-aware Commons Text instance.</p>
+ *
+ * <p>Note that Commons Text computes distances over UTF-16 {@code char} 
units, so
+ * supplementary characters count as two symbols here. For full code-point 
correctness
+ * prefer {@link DamerauOSADistance}.</p>
+ */
+public final class LevenshteinDistance implements EditDistance {
+
+  /** Shared, stateless instance. */
+  public static final LevenshteinDistance INSTANCE = new LevenshteinDistance();
+
+  public LevenshteinDistance() {
+  }
+
+  @Override
+  public int distance(CharSequence a, CharSequence b, int max) {
+    if (a == null || b == null) {
+      throw new NullPointerException("input sequences must not be null");
+    }
+    if (max < 0) {

Review Comment:
   Will `max = 0` be a meaningful value? Give up "early" at zero edit 
distance?! 
   Clarify it on interface `EditDistance`



##########
opennlp-extensions/opennlp-spellcheck/src/main/java/opennlp/spellcheck/symspell/SymSpell.java:
##########
@@ -0,0 +1,605 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package opennlp.spellcheck.symspell;
+
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.HashMap;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Map;
+import java.util.Objects;
+import java.util.Set;
+
+import opennlp.spellcheck.SpellChecker;
+import opennlp.spellcheck.SuggestItem;
+import opennlp.spellcheck.Verbosity;
+import opennlp.spellcheck.distance.EditDistance;
+
+/**
+ * Symmetric Delete spelling correction engine (SymSpell).
+ *
+ * <p>The engine precomputes a deletes-only index: for every dictionary term 
it derives
+ * all strings obtained by deleting up to {@code maxDictionaryEditDistance} 
symbols from
+ * the term's prefix, and maps each such delete back to the originating terms. 
A query is
+ * answered by generating the deletes of the query and intersecting them with 
the index,
+ * which turns the costly fuzzy search into hash-map lookups; candidates are 
then verified
+ * with the injected {@link EditDistance}.</p>
+ *
+ * <p>The algorithm and its compound-correction heuristic are ported from the 
SymSpell
+ * reference implementation (MIT, Wolf Garbe). This is an independent Java 21 
rewrite,
+ * not a verbatim copy; attribution is recorded in the project NOTICE file.</p>
+ *
+ * <p>Populate the engine through {@link #add(String, long)} and
+ * {@link #addBigram(String, String, long)} (typically driven by a separate 
loader), then
+ * issue {@link #lookup} / {@link #lookupCompound} queries. After population 
the engine is
+ * safe for concurrent reads.</p>
+ */
+public final class SymSpell implements SpellChecker {
+
+  private final int maxDictionaryEditDistance;
+  private final int prefixLength;
+  private final long countThreshold;
+  private final EditDistance editDistance;
+
+  /** term -> corpus count. */
+  private final Map<String, Long> words = new HashMap<>();
+
+  /** delete-hash -> list of dictionary terms that produce it. */
+  private final Map<String, String[]> deletes = new HashMap<>();
+
+  /** "w1 w2" -> corpus count, for compound correction. */
+  private final Map<String, Long> bigrams = new HashMap<>();
+
+  private long maxBigramCount;
+
+  /** Smallest bigram count seen; caps the Naive-Bayes estimate so real 
bigrams rank first. */
+  private long minBigramCount = Long.MAX_VALUE;
+
+  /** Length (in code points) of the longest indexed term. */
+  private int maxLength;
+
+  /**
+   * Pinned corpus normalization constant <i>N</i>, or
+   * {@link SymSpellConfig#DERIVE_CORPUS_WORD_COUNT} to derive it from {@link 
#totalCorpusCount}.
+   */
+  private final long configuredCorpusWordCount;
+
+  /** Running sum of every count added (the derived corpus size <i>N</i>). */
+  private long totalCorpusCount;
+
+  public SymSpell(SymSpellConfig config) {
+    Objects.requireNonNull(config, "config must not be null");
+    this.maxDictionaryEditDistance = config.maxDictionaryEditDistance();
+    this.prefixLength = config.prefixLength();
+    this.countThreshold = config.countThreshold();
+    this.editDistance = config.editDistance();
+    this.configuredCorpusWordCount = config.corpusWordCount();
+  }
+
+  /** Creates an engine with the {@linkplain SymSpellConfig#defaultConfig() 
default} config. */
+  public SymSpell() {
+    this(SymSpellConfig.defaultConfig());
+  }
+
+  // ------------------------------------------------------------------
+  // Build hooks (fed by the persistence layer; no I/O happens here).
+  // ------------------------------------------------------------------
+
+  /**
+   * Adds (or accumulates) a dictionary term and its count, updating the 
deletes index.
+   *
+   * <p>If the term already exists, {@code count} is added to the existing 
count. Terms
+   * whose accumulated count stays below the configured {@code countThreshold} 
are tracked
+   * but not indexed until they reach the threshold.</p>
+   *
+   * @param word  the dictionary term; must not be {@code null}
+   * @param count the corpus count to add; must be {@code >= 0}
+   * @return {@code true} if the term became (or remained) indexed
+   */
+  public boolean add(String word, long count) {
+    Objects.requireNonNull(word, "word must not be null");
+    if (count < 0) {
+      throw new IllegalArgumentException("count must not be negative: " + 
count);

Review Comment:
   Add Javadoc detailing `@throws` for IllegalArgumentException.



##########
opennlp-extensions/opennlp-spellcheck/src/main/java/opennlp/spellcheck/symspell/SymSpell.java:
##########
@@ -0,0 +1,605 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package opennlp.spellcheck.symspell;
+
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.HashMap;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Map;
+import java.util.Objects;
+import java.util.Set;
+
+import opennlp.spellcheck.SpellChecker;
+import opennlp.spellcheck.SuggestItem;
+import opennlp.spellcheck.Verbosity;
+import opennlp.spellcheck.distance.EditDistance;
+
+/**
+ * Symmetric Delete spelling correction engine (SymSpell).
+ *
+ * <p>The engine precomputes a deletes-only index: for every dictionary term 
it derives
+ * all strings obtained by deleting up to {@code maxDictionaryEditDistance} 
symbols from
+ * the term's prefix, and maps each such delete back to the originating terms. 
A query is
+ * answered by generating the deletes of the query and intersecting them with 
the index,
+ * which turns the costly fuzzy search into hash-map lookups; candidates are 
then verified
+ * with the injected {@link EditDistance}.</p>
+ *
+ * <p>The algorithm and its compound-correction heuristic are ported from the 
SymSpell
+ * reference implementation (MIT, Wolf Garbe). This is an independent Java 21 
rewrite,
+ * not a verbatim copy; attribution is recorded in the project NOTICE file.</p>
+ *
+ * <p>Populate the engine through {@link #add(String, long)} and
+ * {@link #addBigram(String, String, long)} (typically driven by a separate 
loader), then
+ * issue {@link #lookup} / {@link #lookupCompound} queries. After population 
the engine is
+ * safe for concurrent reads.</p>
+ */
+public final class SymSpell implements SpellChecker {
+
+  private final int maxDictionaryEditDistance;
+  private final int prefixLength;
+  private final long countThreshold;
+  private final EditDistance editDistance;
+
+  /** term -> corpus count. */
+  private final Map<String, Long> words = new HashMap<>();
+
+  /** delete-hash -> list of dictionary terms that produce it. */
+  private final Map<String, String[]> deletes = new HashMap<>();
+
+  /** "w1 w2" -> corpus count, for compound correction. */
+  private final Map<String, Long> bigrams = new HashMap<>();
+
+  private long maxBigramCount;
+
+  /** Smallest bigram count seen; caps the Naive-Bayes estimate so real 
bigrams rank first. */
+  private long minBigramCount = Long.MAX_VALUE;
+
+  /** Length (in code points) of the longest indexed term. */
+  private int maxLength;
+
+  /**
+   * Pinned corpus normalization constant <i>N</i>, or
+   * {@link SymSpellConfig#DERIVE_CORPUS_WORD_COUNT} to derive it from {@link 
#totalCorpusCount}.
+   */
+  private final long configuredCorpusWordCount;
+
+  /** Running sum of every count added (the derived corpus size <i>N</i>). */
+  private long totalCorpusCount;
+
+  public SymSpell(SymSpellConfig config) {

Review Comment:
   Add Javadoc for constructor detailing config parameter.



##########
opennlp-extensions/opennlp-spellcheck/src/main/java/opennlp/spellcheck/distance/DamerauOSADistance.java:
##########
@@ -0,0 +1,151 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package opennlp.spellcheck.distance;
+
+/**
+ * Optimal String Alignment (restricted Damerau-Levenshtein) edit distance.
+ *
+ * <p>Counts insertions, deletions and substitutions, plus transpositions of 
two
+ * adjacent symbols, each with a unit cost. As an optimal-string-alignment 
metric, a
+ * given substring may not be edited more than once, which is the variant used 
by the
+ * SymSpell reference implementation.</p>
+ *
+ * <p>This is the {@linkplain #INSTANCE default} edit distance for the engine. 
It is
+ * Unicode-aware: comparison happens on Unicode code points, so characters 
outside the
+ * Basic Multilingual Plane (e.g. many emoji) are treated as single 
symbols.</p>
+ *
+ * <p>Instances are immutable and thread-safe. A bounded computation with 
early exit is
+ * provided through {@link #distance(CharSequence, CharSequence, int)}.</p>
+ */
+public final class DamerauOSADistance implements EditDistance {
+
+  /** Shared, stateless instance. */
+  public static final DamerauOSADistance INSTANCE = new DamerauOSADistance();
+
+  public DamerauOSADistance() {
+  }
+
+  @Override
+  public int distance(CharSequence a, CharSequence b, int max) {
+    if (a == null || b == null) {
+      throw new NullPointerException("input sequences must not be null");

Review Comment:
   This should throw IllegalArgumentException instead, for consistency with the 
check on max` down below and to express the problem appropriately.
   
   JavaDoc in the related interfaces EditDistance should mention those 
parameter checks and declare it via "@throws".



##########
opennlp-extensions/opennlp-spellcheck/src/main/java/opennlp/spellcheck/stream/package-info.java:
##########
@@ -0,0 +1,28 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+/**
+ * Pipeline integration of the SpellChecker as OpenNLP
+ * {@link opennlp.tools.util.FilterObjectStream} filters.
+ *
+ * <p>{@link opennlp.spellcheck.stream.SpellCorrectingObjectStream} corrects 
whole text
+ * lines read from any {@link opennlp.tools.util.ObjectStream} (e.g. a
+ * {@link opennlp.tools.util.PlainTextByLineStream}), while
+ * {@link opennlp.spellcheck.stream.SpellCorrectingTokenStream} corrects 
tokenized data
+ * token-by-token without changing the token count.</p>
+ */
+package opennlp.spellcheck.stream;

Review Comment:
   I thought, we dropped package-info.java in the past? Is this file obsolete 
then?



##########
opennlp-extensions/opennlp-spellcheck/src/main/java/opennlp/spellcheck/normalizer/SpellCheckingCharSequenceNormalizer.java:
##########
@@ -0,0 +1,398 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package opennlp.spellcheck.normalizer;
+
+import java.util.List;
+import java.util.Locale;
+import java.util.Objects;
+import java.util.regex.Pattern;
+
+import opennlp.spellcheck.SpellChecker;
+import opennlp.spellcheck.SuggestItem;
+import opennlp.spellcheck.Verbosity;
+import opennlp.spellcheck.dictionary.SymSpellModel;
+import opennlp.tools.util.normalizer.AggregateCharSequenceNormalizer;
+import opennlp.tools.util.normalizer.CharSequenceNormalizer;
+
+/**
+ * A {@link CharSequenceNormalizer} that corrects spelling in text using a
+ * {@link SpellChecker} (typically a SymSpell engine).
+ *
+ * <p>The normalizer works in one of two {@linkplain Mode modes}:</p>
+ * <ul>
+ *   <li>{@link Mode#PER_TOKEN PER_TOKEN} (default) &ndash; the input is split 
into
+ *       whitespace-delimited tokens and each token is corrected independently 
with
+ *       {@link SpellChecker#lookup}. The original whitespace runs between 
tokens are
+ *       preserved verbatim, so the shape of the line is kept. Tokens the 
dictionary
+ *       already contains (best suggestion at edit distance {@code 0}) are left
+ *       untouched.</li>
+ *   <li>{@link Mode#COMPOUND COMPOUND} &ndash; the whole input is passed to
+ *       {@link SpellChecker#lookupCompound}, which additionally repairs 
wrongly
+ *       inserted or omitted spaces (word splits and merges). This collapses 
runs of
+ *       whitespace to single spaces, as the compound corrector re-tokenizes 
the
+ *       input.</li>
+ * </ul>
+ *
+ * <p>Several guards keep the corrector from "fixing" tokens that should be 
left as
+ * they are (configurable through the {@link Builder}):</p>
+ * <ul>
+ *   <li>tokens shorter than {@code minTokenLength} are skipped;</li>
+ *   <li>numeric tokens are skipped ({@code skipNumbers}, on by default);</li>
+ *   <li>URL- and email-like tokens are skipped ({@code skipUrls}, on by 
default);</li>
+ *   <li>a token whose lower-cased form is already in the dictionary is never
+ *       changed (the engine returns it at edit distance {@code 0}).</li>
+ * </ul>
+ *
+ * <p><b>Casing.</b> Dictionaries are normally lower-cased, so lookups are 
performed on
+ * the lower-cased token, and the original casing pattern is re-applied to the
+ * correction: an all-upper token yields an all-upper correction, a 
leading-capital
+ * token yields a leading-capital correction, otherwise the suggestion's own 
casing is
+ * used. When no correction applies, the original token (including its casing 
and any
+ * surrounding punctuation) is emitted unchanged.</p>
+ *
+ * <p>This normalizer composes cleanly inside an
+ * {@link AggregateCharSequenceNormalizer}; place it after noise-removing 
normalizers
+ * (URL, emoji, shrink) so it sees clean tokens.</p>
+ *
+ * <p><b>Serialization.</b> {@link CharSequenceNormalizer} is {@link 
java.io.Serializable},
+ * but the backing {@link SpellChecker} usually is not; it is therefore held 
in a
+ * {@code transient} field and is {@code null} after Java deserialization. A 
deserialized
+ * instance is inert until a checker is re-attached: obtain a working copy 
with the same
+ * settings via {@link #withSpellChecker(SpellChecker)} (this matches how the 
engine is
+ * rebuilt from a model rather than Java-serialized). Calling {@link 
#normalize} on an
+ * instance with no checker throws {@link IllegalStateException}.</p>
+ */
+public class SpellCheckingCharSequenceNormalizer implements 
CharSequenceNormalizer {
+
+  private static final long serialVersionUID = 1L;

Review Comment:
   Do we really need it to be serializable?



##########
opennlp-extensions/opennlp-spellcheck/src/main/java/opennlp/spellcheck/symspell/package-info.java:
##########
@@ -0,0 +1,28 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+/**
+ * The Symmetric Delete (SymSpell) spelling-correction engine.
+ *
+ * <p>{@link opennlp.spellcheck.symspell.SymSpell} implements
+ * {@link opennlp.spellcheck.SpellChecker} using a precomputed deletes-only 
index for
+ * single-term lookup and compound (split/merge) correction;
+ * {@link opennlp.spellcheck.symspell.SymSpellConfig} holds its immutable, 
builder-created
+ * configuration. The engine is an independent Java re-implementation of the 
algorithm by
+ * Wolf Garbe (attribution in the project NOTICE file).</p>
+ */
+package opennlp.spellcheck.symspell;

Review Comment:
   See previous comments on `package-info.java` files



##########
opennlp-extensions/opennlp-spellcheck/src/main/java/opennlp/spellcheck/symspell/SymSpellConfig.java:
##########
@@ -0,0 +1,192 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package opennlp.spellcheck.symspell;
+
+import java.util.Objects;
+
+import opennlp.spellcheck.distance.DamerauOSADistance;
+import opennlp.spellcheck.distance.EditDistance;
+
+/**
+ * Immutable configuration for {@link SymSpell}, created through {@link 
#builder()}.
+ *
+ * <p>The tunables mirror the SymSpell reference implementation:</p>
+ * <ul>
+ *   <li><b>maxDictionaryEditDistance</b> &ndash; the largest edit distance 
for which the
+ *       deletes index is precomputed; queries cannot exceed this.</li>
+ *   <li><b>prefixLength</b> &ndash; only the first {@code prefixLength} 
symbols of each
+ *       term are used to generate deletes, trading index size for recall on 
long words.</li>
+ *   <li><b>countThreshold</b> &ndash; the minimum corpus count for a term to 
be indexed.</li>
+ *   <li><b>editDistance</b> &ndash; the verification metric, defaulting to
+ *       {@link DamerauOSADistance}.</li>
+ *   <li><b>corpusWordCount</b> &ndash; the corpus normalization constant 
<i>N</i> used by
+ *       the Naive-Bayes word combine/split scoring in
+ *       {@link SymSpell#lookupCompound(String, int)}. Defaults to
+ *       {@link #DERIVE_CORPUS_WORD_COUNT}, which makes the engine derive 
<i>N</i> from the
+ *       summed counts of the loaded dictionary so it is always 
corpus-correct; set it
+ *       explicitly to pin <i>N</i> (e.g. to the full-corpus total a reference 
dictionary was
+ *       drawn from).</li>
+ * </ul>
+ */
+public final class SymSpellConfig {
+
+  /**
+   * Sentinel for {@link #corpusWordCount()} meaning "derive <i>N</i> from the 
summed counts
+   * of the loaded dictionary" rather than pinning it to a fixed value.
+   */
+  public static final long DERIVE_CORPUS_WORD_COUNT = 0L;
+
+  private final int maxDictionaryEditDistance;
+  private final int prefixLength;
+  private final long countThreshold;
+  private final EditDistance editDistance;
+  private final long corpusWordCount;
+
+  private SymSpellConfig(Builder b) {
+    this.maxDictionaryEditDistance = b.maxDictionaryEditDistance;
+    this.prefixLength = b.prefixLength;
+    this.countThreshold = b.countThreshold;
+    this.editDistance = b.editDistance;
+    this.corpusWordCount = b.corpusWordCount;
+  }
+
+  public int maxDictionaryEditDistance() {
+    return maxDictionaryEditDistance;
+  }
+
+  public int prefixLength() {
+    return prefixLength;
+  }
+
+  public long countThreshold() {
+    return countThreshold;
+  }
+
+  public EditDistance editDistance() {
+    return editDistance;
+  }
+
+  /**
+   * @return the pinned corpus normalization constant <i>N</i>, or
+   *     {@link #DERIVE_CORPUS_WORD_COUNT} when <i>N</i> is derived from the 
loaded
+   *     dictionary's summed counts.
+   */
+  public long corpusWordCount() {
+    return corpusWordCount;
+  }
+
+  /**
+   * @return a builder with the SymSpell reference defaults
+   *     (maxDictionaryEditDistance=2, prefixLength=7, countThreshold=1,
+   *     editDistance={@link DamerauOSADistance}).
+   */
+  public static Builder builder() {
+    return new Builder();
+  }
+
+  /** @return a configuration with all reference defaults. */
+  public static SymSpellConfig defaultConfig() {
+    return builder().build();
+  }
+
+  /** Mutable builder for {@link SymSpellConfig}. */
+  public static final class Builder {
+
+    private int maxDictionaryEditDistance = 2;
+    private int prefixLength = 7;
+    private long countThreshold = 1;
+    private EditDistance editDistance = DamerauOSADistance.INSTANCE;
+    private long corpusWordCount = DERIVE_CORPUS_WORD_COUNT;
+
+    private Builder() {
+    }
+
+    /**
+     * @param value largest precomputed dictionary edit distance; must be 
{@code >= 0}
+     * @return this builder
+     */
+    public Builder maxDictionaryEditDistance(int value) {
+      if (value < 0) {
+        throw new IllegalArgumentException("maxDictionaryEditDistance must be 
>= 0: " + value);
+      }
+      this.maxDictionaryEditDistance = value;
+      return this;
+    }
+
+    /**
+     * @param value number of leading symbols used for delete generation; must 
be
+     *     {@code >= 1} and {@code > maxDictionaryEditDistance}
+     * @return this builder
+     */
+    public Builder prefixLength(int value) {
+      if (value < 1) {
+        throw new IllegalArgumentException("prefixLength must be >= 1: " + 
value);
+      }
+      this.prefixLength = value;
+      return this;
+    }
+
+    /**
+     * @param value minimum corpus count for a term to be indexed; must be 
{@code >= 1}
+     * @return this builder
+     */
+    public Builder countThreshold(long value) {
+      if (value < 1) {
+        throw new IllegalArgumentException("countThreshold must be >= 1: " + 
value);

Review Comment:
   Add Javadoc detailing `@throws `for IllegalArgumentException.



##########
opennlp-extensions/opennlp-spellcheck/src/main/java/opennlp/spellcheck/symspell/SymSpellConfig.java:
##########
@@ -0,0 +1,192 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package opennlp.spellcheck.symspell;
+
+import java.util.Objects;
+
+import opennlp.spellcheck.distance.DamerauOSADistance;
+import opennlp.spellcheck.distance.EditDistance;
+
+/**
+ * Immutable configuration for {@link SymSpell}, created through {@link 
#builder()}.
+ *
+ * <p>The tunables mirror the SymSpell reference implementation:</p>
+ * <ul>
+ *   <li><b>maxDictionaryEditDistance</b> &ndash; the largest edit distance 
for which the
+ *       deletes index is precomputed; queries cannot exceed this.</li>
+ *   <li><b>prefixLength</b> &ndash; only the first {@code prefixLength} 
symbols of each
+ *       term are used to generate deletes, trading index size for recall on 
long words.</li>
+ *   <li><b>countThreshold</b> &ndash; the minimum corpus count for a term to 
be indexed.</li>
+ *   <li><b>editDistance</b> &ndash; the verification metric, defaulting to
+ *       {@link DamerauOSADistance}.</li>
+ *   <li><b>corpusWordCount</b> &ndash; the corpus normalization constant 
<i>N</i> used by
+ *       the Naive-Bayes word combine/split scoring in
+ *       {@link SymSpell#lookupCompound(String, int)}. Defaults to
+ *       {@link #DERIVE_CORPUS_WORD_COUNT}, which makes the engine derive 
<i>N</i> from the
+ *       summed counts of the loaded dictionary so it is always 
corpus-correct; set it
+ *       explicitly to pin <i>N</i> (e.g. to the full-corpus total a reference 
dictionary was
+ *       drawn from).</li>
+ * </ul>
+ */
+public final class SymSpellConfig {
+
+  /**
+   * Sentinel for {@link #corpusWordCount()} meaning "derive <i>N</i> from the 
summed counts
+   * of the loaded dictionary" rather than pinning it to a fixed value.
+   */
+  public static final long DERIVE_CORPUS_WORD_COUNT = 0L;
+
+  private final int maxDictionaryEditDistance;
+  private final int prefixLength;
+  private final long countThreshold;
+  private final EditDistance editDistance;
+  private final long corpusWordCount;
+
+  private SymSpellConfig(Builder b) {
+    this.maxDictionaryEditDistance = b.maxDictionaryEditDistance;
+    this.prefixLength = b.prefixLength;
+    this.countThreshold = b.countThreshold;
+    this.editDistance = b.editDistance;
+    this.corpusWordCount = b.corpusWordCount;
+  }
+
+  public int maxDictionaryEditDistance() {
+    return maxDictionaryEditDistance;
+  }
+
+  public int prefixLength() {
+    return prefixLength;
+  }
+
+  public long countThreshold() {
+    return countThreshold;
+  }
+
+  public EditDistance editDistance() {
+    return editDistance;
+  }
+
+  /**
+   * @return the pinned corpus normalization constant <i>N</i>, or
+   *     {@link #DERIVE_CORPUS_WORD_COUNT} when <i>N</i> is derived from the 
loaded
+   *     dictionary's summed counts.
+   */
+  public long corpusWordCount() {
+    return corpusWordCount;
+  }
+
+  /**
+   * @return a builder with the SymSpell reference defaults

Review Comment:
   Can we add a `@see <a href="..">...</a>` here that links the external 
reference for this properly?



##########
opennlp-extensions/opennlp-spellcheck/src/main/java/opennlp/spellcheck/symspell/SymSpell.java:
##########
@@ -0,0 +1,605 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package opennlp.spellcheck.symspell;
+
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.HashMap;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Map;
+import java.util.Objects;
+import java.util.Set;
+
+import opennlp.spellcheck.SpellChecker;
+import opennlp.spellcheck.SuggestItem;
+import opennlp.spellcheck.Verbosity;
+import opennlp.spellcheck.distance.EditDistance;
+
+/**
+ * Symmetric Delete spelling correction engine (SymSpell).
+ *
+ * <p>The engine precomputes a deletes-only index: for every dictionary term 
it derives
+ * all strings obtained by deleting up to {@code maxDictionaryEditDistance} 
symbols from
+ * the term's prefix, and maps each such delete back to the originating terms. 
A query is
+ * answered by generating the deletes of the query and intersecting them with 
the index,
+ * which turns the costly fuzzy search into hash-map lookups; candidates are 
then verified
+ * with the injected {@link EditDistance}.</p>
+ *
+ * <p>The algorithm and its compound-correction heuristic are ported from the 
SymSpell
+ * reference implementation (MIT, Wolf Garbe). This is an independent Java 21 
rewrite,
+ * not a verbatim copy; attribution is recorded in the project NOTICE file.</p>
+ *
+ * <p>Populate the engine through {@link #add(String, long)} and
+ * {@link #addBigram(String, String, long)} (typically driven by a separate 
loader), then
+ * issue {@link #lookup} / {@link #lookupCompound} queries. After population 
the engine is
+ * safe for concurrent reads.</p>
+ */
+public final class SymSpell implements SpellChecker {
+
+  private final int maxDictionaryEditDistance;
+  private final int prefixLength;
+  private final long countThreshold;
+  private final EditDistance editDistance;
+
+  /** term -> corpus count. */
+  private final Map<String, Long> words = new HashMap<>();
+
+  /** delete-hash -> list of dictionary terms that produce it. */
+  private final Map<String, String[]> deletes = new HashMap<>();
+
+  /** "w1 w2" -> corpus count, for compound correction. */
+  private final Map<String, Long> bigrams = new HashMap<>();
+
+  private long maxBigramCount;
+
+  /** Smallest bigram count seen; caps the Naive-Bayes estimate so real 
bigrams rank first. */
+  private long minBigramCount = Long.MAX_VALUE;
+
+  /** Length (in code points) of the longest indexed term. */
+  private int maxLength;
+
+  /**
+   * Pinned corpus normalization constant <i>N</i>, or
+   * {@link SymSpellConfig#DERIVE_CORPUS_WORD_COUNT} to derive it from {@link 
#totalCorpusCount}.
+   */
+  private final long configuredCorpusWordCount;
+
+  /** Running sum of every count added (the derived corpus size <i>N</i>). */
+  private long totalCorpusCount;
+
+  public SymSpell(SymSpellConfig config) {
+    Objects.requireNonNull(config, "config must not be null");
+    this.maxDictionaryEditDistance = config.maxDictionaryEditDistance();
+    this.prefixLength = config.prefixLength();
+    this.countThreshold = config.countThreshold();
+    this.editDistance = config.editDistance();
+    this.configuredCorpusWordCount = config.corpusWordCount();
+  }
+
+  /** Creates an engine with the {@linkplain SymSpellConfig#defaultConfig() 
default} config. */
+  public SymSpell() {
+    this(SymSpellConfig.defaultConfig());
+  }
+
+  // ------------------------------------------------------------------
+  // Build hooks (fed by the persistence layer; no I/O happens here).
+  // ------------------------------------------------------------------
+
+  /**
+   * Adds (or accumulates) a dictionary term and its count, updating the 
deletes index.
+   *
+   * <p>If the term already exists, {@code count} is added to the existing 
count. Terms
+   * whose accumulated count stays below the configured {@code countThreshold} 
are tracked
+   * but not indexed until they reach the threshold.</p>
+   *
+   * @param word  the dictionary term; must not be {@code null}
+   * @param count the corpus count to add; must be {@code >= 0}
+   * @return {@code true} if the term became (or remained) indexed
+   */
+  public boolean add(String word, long count) {
+    Objects.requireNonNull(word, "word must not be null");
+    if (count < 0) {
+      throw new IllegalArgumentException("count must not be negative: " + 
count);
+    }
+    // Every occurrence (even of sub-threshold terms) contributes to the 
corpus size N.
+    totalCorpusCount = saturatedAdd(totalCorpusCount, count);
+    if (count == 0 && countThreshold > 0) {
+      return false;
+    }
+
+    final Long previous = words.get(word);
+    long newCount;
+    if (previous != null) {
+      newCount = saturatedAdd(previous, count);
+      words.put(word, newCount);
+      // Already indexed previously if it had cleared the threshold.
+      if (previous >= countThreshold) {
+        return true;
+      }
+      if (newCount < countThreshold) {
+        return false;
+      }
+    } else {
+      newCount = count;
+      words.put(word, newCount);
+      if (newCount < countThreshold) {
+        return false;
+      }
+    }
+
+    final int wordLen = word.codePointCount(0, word.length());
+    if (wordLen > maxLength) {
+      maxLength = wordLen;
+    }
+
+    for (String delete : editsPrefix(word)) {
+      final String[] existing = deletes.get(delete);
+      if (existing == null) {
+        deletes.put(delete, new String[] {word});
+      } else {
+        final String[] grown = new String[existing.length + 1];
+        System.arraycopy(existing, 0, grown, 0, existing.length);
+        grown[existing.length] = word;
+        deletes.put(delete, grown);
+      }
+    }
+    return true;
+  }
+
+  /**
+   * Adds (or accumulates) a bigram and its count for compound correction.
+   *
+   * @param w1    the first word; must not be {@code null}
+   * @param w2    the second word; must not be {@code null}
+   * @param count the corpus count to add; must be {@code >= 0}
+   */
+  public void addBigram(String w1, String w2, long count) {
+    Objects.requireNonNull(w1, "w1 must not be null");
+    Objects.requireNonNull(w2, "w2 must not be null");
+    if (count < 0) {
+      throw new IllegalArgumentException("count must not be negative: " + 
count);

Review Comment:
   Add Javadoc detailing `@throws` for IllegalArgumentException.



##########
opennlp-extensions/opennlp-spellcheck/src/main/java/opennlp/spellcheck/symspell/SymSpellConfig.java:
##########
@@ -0,0 +1,192 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package opennlp.spellcheck.symspell;
+
+import java.util.Objects;
+
+import opennlp.spellcheck.distance.DamerauOSADistance;
+import opennlp.spellcheck.distance.EditDistance;
+
+/**
+ * Immutable configuration for {@link SymSpell}, created through {@link 
#builder()}.
+ *
+ * <p>The tunables mirror the SymSpell reference implementation:</p>
+ * <ul>
+ *   <li><b>maxDictionaryEditDistance</b> &ndash; the largest edit distance 
for which the
+ *       deletes index is precomputed; queries cannot exceed this.</li>
+ *   <li><b>prefixLength</b> &ndash; only the first {@code prefixLength} 
symbols of each
+ *       term are used to generate deletes, trading index size for recall on 
long words.</li>
+ *   <li><b>countThreshold</b> &ndash; the minimum corpus count for a term to 
be indexed.</li>
+ *   <li><b>editDistance</b> &ndash; the verification metric, defaulting to
+ *       {@link DamerauOSADistance}.</li>
+ *   <li><b>corpusWordCount</b> &ndash; the corpus normalization constant 
<i>N</i> used by
+ *       the Naive-Bayes word combine/split scoring in
+ *       {@link SymSpell#lookupCompound(String, int)}. Defaults to
+ *       {@link #DERIVE_CORPUS_WORD_COUNT}, which makes the engine derive 
<i>N</i> from the
+ *       summed counts of the loaded dictionary so it is always 
corpus-correct; set it
+ *       explicitly to pin <i>N</i> (e.g. to the full-corpus total a reference 
dictionary was
+ *       drawn from).</li>
+ * </ul>
+ */
+public final class SymSpellConfig {
+
+  /**
+   * Sentinel for {@link #corpusWordCount()} meaning "derive <i>N</i> from the 
summed counts
+   * of the loaded dictionary" rather than pinning it to a fixed value.
+   */
+  public static final long DERIVE_CORPUS_WORD_COUNT = 0L;
+
+  private final int maxDictionaryEditDistance;
+  private final int prefixLength;
+  private final long countThreshold;
+  private final EditDistance editDistance;
+  private final long corpusWordCount;
+
+  private SymSpellConfig(Builder b) {
+    this.maxDictionaryEditDistance = b.maxDictionaryEditDistance;
+    this.prefixLength = b.prefixLength;
+    this.countThreshold = b.countThreshold;
+    this.editDistance = b.editDistance;
+    this.corpusWordCount = b.corpusWordCount;
+  }
+
+  public int maxDictionaryEditDistance() {
+    return maxDictionaryEditDistance;
+  }
+
+  public int prefixLength() {
+    return prefixLength;
+  }
+
+  public long countThreshold() {
+    return countThreshold;
+  }
+
+  public EditDistance editDistance() {
+    return editDistance;
+  }
+
+  /**
+   * @return the pinned corpus normalization constant <i>N</i>, or
+   *     {@link #DERIVE_CORPUS_WORD_COUNT} when <i>N</i> is derived from the 
loaded
+   *     dictionary's summed counts.
+   */
+  public long corpusWordCount() {
+    return corpusWordCount;
+  }
+
+  /**
+   * @return a builder with the SymSpell reference defaults
+   *     (maxDictionaryEditDistance=2, prefixLength=7, countThreshold=1,
+   *     editDistance={@link DamerauOSADistance}).
+   */
+  public static Builder builder() {
+    return new Builder();
+  }
+
+  /** @return a configuration with all reference defaults. */
+  public static SymSpellConfig defaultConfig() {
+    return builder().build();
+  }
+
+  /** Mutable builder for {@link SymSpellConfig}. */
+  public static final class Builder {
+
+    private int maxDictionaryEditDistance = 2;
+    private int prefixLength = 7;
+    private long countThreshold = 1;
+    private EditDistance editDistance = DamerauOSADistance.INSTANCE;
+    private long corpusWordCount = DERIVE_CORPUS_WORD_COUNT;
+
+    private Builder() {
+    }
+
+    /**
+     * @param value largest precomputed dictionary edit distance; must be 
{@code >= 0}
+     * @return this builder
+     */
+    public Builder maxDictionaryEditDistance(int value) {
+      if (value < 0) {
+        throw new IllegalArgumentException("maxDictionaryEditDistance must be 
>= 0: " + value);
+      }
+      this.maxDictionaryEditDistance = value;
+      return this;
+    }
+
+    /**
+     * @param value number of leading symbols used for delete generation; must 
be
+     *     {@code >= 1} and {@code > maxDictionaryEditDistance}
+     * @return this builder
+     */
+    public Builder prefixLength(int value) {
+      if (value < 1) {
+        throw new IllegalArgumentException("prefixLength must be >= 1: " + 
value);

Review Comment:
   Add Javadoc detailing `@throws `for IllegalArgumentException.



##########
opennlp-extensions/opennlp-spellcheck/src/main/java/opennlp/spellcheck/symspell/SymSpellConfig.java:
##########
@@ -0,0 +1,192 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package opennlp.spellcheck.symspell;
+
+import java.util.Objects;
+
+import opennlp.spellcheck.distance.DamerauOSADistance;
+import opennlp.spellcheck.distance.EditDistance;
+
+/**
+ * Immutable configuration for {@link SymSpell}, created through {@link 
#builder()}.
+ *
+ * <p>The tunables mirror the SymSpell reference implementation:</p>
+ * <ul>
+ *   <li><b>maxDictionaryEditDistance</b> &ndash; the largest edit distance 
for which the
+ *       deletes index is precomputed; queries cannot exceed this.</li>
+ *   <li><b>prefixLength</b> &ndash; only the first {@code prefixLength} 
symbols of each
+ *       term are used to generate deletes, trading index size for recall on 
long words.</li>
+ *   <li><b>countThreshold</b> &ndash; the minimum corpus count for a term to 
be indexed.</li>
+ *   <li><b>editDistance</b> &ndash; the verification metric, defaulting to
+ *       {@link DamerauOSADistance}.</li>
+ *   <li><b>corpusWordCount</b> &ndash; the corpus normalization constant 
<i>N</i> used by
+ *       the Naive-Bayes word combine/split scoring in
+ *       {@link SymSpell#lookupCompound(String, int)}. Defaults to
+ *       {@link #DERIVE_CORPUS_WORD_COUNT}, which makes the engine derive 
<i>N</i> from the
+ *       summed counts of the loaded dictionary so it is always 
corpus-correct; set it
+ *       explicitly to pin <i>N</i> (e.g. to the full-corpus total a reference 
dictionary was
+ *       drawn from).</li>
+ * </ul>
+ */
+public final class SymSpellConfig {
+
+  /**
+   * Sentinel for {@link #corpusWordCount()} meaning "derive <i>N</i> from the 
summed counts
+   * of the loaded dictionary" rather than pinning it to a fixed value.
+   */
+  public static final long DERIVE_CORPUS_WORD_COUNT = 0L;
+
+  private final int maxDictionaryEditDistance;
+  private final int prefixLength;
+  private final long countThreshold;
+  private final EditDistance editDistance;
+  private final long corpusWordCount;
+
+  private SymSpellConfig(Builder b) {
+    this.maxDictionaryEditDistance = b.maxDictionaryEditDistance;
+    this.prefixLength = b.prefixLength;
+    this.countThreshold = b.countThreshold;
+    this.editDistance = b.editDistance;
+    this.corpusWordCount = b.corpusWordCount;
+  }
+
+  public int maxDictionaryEditDistance() {
+    return maxDictionaryEditDistance;
+  }
+
+  public int prefixLength() {
+    return prefixLength;
+  }
+
+  public long countThreshold() {
+    return countThreshold;
+  }
+
+  public EditDistance editDistance() {
+    return editDistance;
+  }
+
+  /**
+   * @return the pinned corpus normalization constant <i>N</i>, or
+   *     {@link #DERIVE_CORPUS_WORD_COUNT} when <i>N</i> is derived from the 
loaded
+   *     dictionary's summed counts.
+   */
+  public long corpusWordCount() {
+    return corpusWordCount;
+  }
+
+  /**
+   * @return a builder with the SymSpell reference defaults
+   *     (maxDictionaryEditDistance=2, prefixLength=7, countThreshold=1,
+   *     editDistance={@link DamerauOSADistance}).
+   */
+  public static Builder builder() {
+    return new Builder();
+  }
+
+  /** @return a configuration with all reference defaults. */
+  public static SymSpellConfig defaultConfig() {
+    return builder().build();
+  }
+
+  /** Mutable builder for {@link SymSpellConfig}. */
+  public static final class Builder {
+
+    private int maxDictionaryEditDistance = 2;
+    private int prefixLength = 7;
+    private long countThreshold = 1;
+    private EditDistance editDistance = DamerauOSADistance.INSTANCE;
+    private long corpusWordCount = DERIVE_CORPUS_WORD_COUNT;
+
+    private Builder() {
+    }
+
+    /**
+     * @param value largest precomputed dictionary edit distance; must be 
{@code >= 0}
+     * @return this builder
+     */
+    public Builder maxDictionaryEditDistance(int value) {
+      if (value < 0) {
+        throw new IllegalArgumentException("maxDictionaryEditDistance must be 
>= 0: " + value);
+      }
+      this.maxDictionaryEditDistance = value;
+      return this;
+    }
+
+    /**
+     * @param value number of leading symbols used for delete generation; must 
be
+     *     {@code >= 1} and {@code > maxDictionaryEditDistance}
+     * @return this builder
+     */
+    public Builder prefixLength(int value) {
+      if (value < 1) {
+        throw new IllegalArgumentException("prefixLength must be >= 1: " + 
value);
+      }
+      this.prefixLength = value;
+      return this;
+    }
+
+    /**
+     * @param value minimum corpus count for a term to be indexed; must be 
{@code >= 1}
+     * @return this builder
+     */
+    public Builder countThreshold(long value) {
+      if (value < 1) {
+        throw new IllegalArgumentException("countThreshold must be >= 1: " + 
value);
+      }
+      this.countThreshold = value;
+      return this;
+    }
+
+    /**
+     * @param value verification metric to inject; must not be {@code null}
+     * @return this builder
+     */
+    public Builder editDistance(EditDistance value) {
+      this.editDistance = Objects.requireNonNull(value, "editDistance must not 
be null");
+      return this;
+    }
+
+    /**
+     * Pins the corpus normalization constant <i>N</i> used by the Naive-Bayes 
word
+     * combine/split scoring in {@link SymSpell#lookupCompound(String, int)}.
+     *
+     * @param value the corpus word count to pin, or {@link 
#DERIVE_CORPUS_WORD_COUNT} to
+     *              derive <i>N</i> from the loaded dictionary's summed 
counts; must be
+     *              {@code >= 0}
+     * @return this builder
+     */
+    public Builder corpusWordCount(long value) {
+      if (value < 0) {
+        throw new IllegalArgumentException("corpusWordCount must be >= 0: " + 
value);

Review Comment:
   Add Javadoc detailing `@throws `for IllegalArgumentException.



##########
opennlp-extensions/opennlp-spellcheck/src/main/java/opennlp/spellcheck/SpellChecker.java:
##########
@@ -0,0 +1,70 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package opennlp.spellcheck;
+
+import java.util.List;
+
+/**
+ * A spelling corrector that proposes {@link SuggestItem suggestions} for 
individual
+ * terms and corrects whole sentences.
+ *
+ * <p>Implementations are expected to be thread-safe for concurrent {@code 
lookup} calls
+ * once their dictionary has been fully populated <i>and safely published</i> 
to the reading
+ * threads (e.g. stored in a {@code final} field, or otherwise guarded so a 
happens-before
+ * edge exists between population and the first read). Population itself is 
not required to
+ * be thread-safe.</p>
+ */
+public interface SpellChecker {
+
+  /**
+   * Looks up suggestions for a single {@code term} within {@code 
maxEditDistance}.
+   *
+   * @param term           the (possibly misspelled) term to correct; must not 
be {@code null}
+   * @param verbosity      controls how many suggestions are returned
+   * @param maxEditDistance the maximum edit distance to consider; must not be 
negative and
+   *                       must not exceed {@link #maxEditDistance()}
+   * @return the matching suggestions in natural order (best first); never 
{@code null}
+   */
+  List<SuggestItem> lookup(String term, Verbosity verbosity, int 
maxEditDistance);
+
+  /**
+   * @return the largest edit distance this checker can answer queries for 
(the configured
+   *     maximum dictionary edit distance); a {@code maxEditDistance} argument 
to
+   *     {@link #lookup(String, Verbosity, int)} must not exceed this value.
+   */
+  int maxEditDistance();
+
+  /**
+   * Convenience overload that uses {@link Verbosity#TOP} and the 
implementation's
+   * configured maximum dictionary edit distance.
+   *
+   * @param term the (possibly misspelled) term to correct; must not be {@code 
null}

Review Comment:
   What if `term` is a blank String instance?



##########
opennlp-extensions/opennlp-spellcheck/src/main/java/opennlp/spellcheck/SpellChecker.java:
##########
@@ -0,0 +1,70 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package opennlp.spellcheck;
+
+import java.util.List;
+
+/**
+ * A spelling corrector that proposes {@link SuggestItem suggestions} for 
individual
+ * terms and corrects whole sentences.
+ *
+ * <p>Implementations are expected to be thread-safe for concurrent {@code 
lookup} calls
+ * once their dictionary has been fully populated <i>and safely published</i> 
to the reading
+ * threads (e.g. stored in a {@code final} field, or otherwise guarded so a 
happens-before
+ * edge exists between population and the first read). Population itself is 
not required to
+ * be thread-safe.</p>
+ */
+public interface SpellChecker {
+
+  /**
+   * Looks up suggestions for a single {@code term} within {@code 
maxEditDistance}.
+   *
+   * @param term           the (possibly misspelled) term to correct; must not 
be {@code null}
+   * @param verbosity      controls how many suggestions are returned
+   * @param maxEditDistance the maximum edit distance to consider; must not be 
negative and
+   *                       must not exceed {@link #maxEditDistance()}
+   * @return the matching suggestions in natural order (best first); never 
{@code null}
+   */
+  List<SuggestItem> lookup(String term, Verbosity verbosity, int 
maxEditDistance);
+
+  /**
+   * @return the largest edit distance this checker can answer queries for 
(the configured
+   *     maximum dictionary edit distance); a {@code maxEditDistance} argument 
to
+   *     {@link #lookup(String, Verbosity, int)} must not exceed this value.
+   */
+  int maxEditDistance();
+
+  /**
+   * Convenience overload that uses {@link Verbosity#TOP} and the 
implementation's
+   * configured maximum dictionary edit distance.
+   *
+   * @param term the (possibly misspelled) term to correct; must not be {@code 
null}
+   * @return the matching suggestions in natural order (best first); never 
{@code null}
+   */
+  List<SuggestItem> lookup(String term);
+
+  /**
+   * Corrects a whole input string (a phrase or sentence), supporting word 
splits and
+   * merges, and combining candidates using a bigram language model.
+   *
+   * @param input           the input phrase to correct; must not be {@code 
null}

Review Comment:
   What if `input` is a blank String instance?



##########
opennlp-extensions/opennlp-spellcheck/src/main/java/opennlp/spellcheck/package-info.java:
##########
@@ -0,0 +1,27 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+/**
+ * Native SymSpell-based spell-correction component for Apache OpenNLP
+ * (OPENNLP-1832).
+ *
+ * <p>This root package defines the public API: {@link 
opennlp.spellcheck.SpellChecker},
+ * the {@link opennlp.spellcheck.SuggestItem} result type and the
+ * {@link opennlp.spellcheck.Verbosity} lookup mode. The engine, edit-distance 
metrics,
+ * persistence, pipeline integration and command-line tools live in the 
subpackages.</p>
+ */
+package opennlp.spellcheck;

Review Comment:
   see previous comments on package-info.java files



##########
opennlp-extensions/opennlp-spellcheck/src/main/java/opennlp/spellcheck/SpellChecker.java:
##########
@@ -0,0 +1,70 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package opennlp.spellcheck;
+
+import java.util.List;
+
+/**
+ * A spelling corrector that proposes {@link SuggestItem suggestions} for 
individual
+ * terms and corrects whole sentences.
+ *
+ * <p>Implementations are expected to be thread-safe for concurrent {@code 
lookup} calls
+ * once their dictionary has been fully populated <i>and safely published</i> 
to the reading
+ * threads (e.g. stored in a {@code final} field, or otherwise guarded so a 
happens-before
+ * edge exists between population and the first read). Population itself is 
not required to
+ * be thread-safe.</p>
+ */
+public interface SpellChecker {
+
+  /**
+   * Looks up suggestions for a single {@code term} within {@code 
maxEditDistance}.
+   *
+   * @param term           the (possibly misspelled) term to correct; must not 
be {@code null}
+   * @param verbosity      controls how many suggestions are returned
+   * @param maxEditDistance the maximum edit distance to consider; must not be 
negative and
+   *                       must not exceed {@link #maxEditDistance()}
+   * @return the matching suggestions in natural order (best first); never 
{@code null}
+   */
+  List<SuggestItem> lookup(String term, Verbosity verbosity, int 
maxEditDistance);
+
+  /**
+   * @return the largest edit distance this checker can answer queries for 
(the configured
+   *     maximum dictionary edit distance); a {@code maxEditDistance} argument 
to
+   *     {@link #lookup(String, Verbosity, int)} must not exceed this value.
+   */
+  int maxEditDistance();
+
+  /**
+   * Convenience overload that uses {@link Verbosity#TOP} and the 
implementation's
+   * configured maximum dictionary edit distance.
+   *
+   * @param term the (possibly misspelled) term to correct; must not be {@code 
null}
+   * @return the matching suggestions in natural order (best first); never 
{@code null}
+   */
+  List<SuggestItem> lookup(String term);

Review Comment:
   What is thrown in case parameter expectations are not met? 
IllegalArgumentException, I'd guess? Please add appropriate `@throws` Javadoc 
for this.



##########
pom.xml:
##########
@@ -210,6 +210,18 @@
                                <version>${project.version}</version>
                        </dependency>
 
+                       <dependency>
+                               <artifactId>opennlp-spellcheck</artifactId>
+                               <groupId>${project.groupId}</groupId>
+                               <version>${project.version}</version>
+                       </dependency>
+
+                       <dependency>
+                               <groupId>org.apache.commons</groupId>
+                               <artifactId>commons-text</artifactId>

Review Comment:
   Questioned dependency declaration - see previous code related comment in 
LevenshteinDistance impl.



##########
opennlp-extensions/opennlp-spellcheck/src/main/java/opennlp/spellcheck/SpellChecker.java:
##########
@@ -0,0 +1,70 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package opennlp.spellcheck;
+
+import java.util.List;
+
+/**
+ * A spelling corrector that proposes {@link SuggestItem suggestions} for 
individual
+ * terms and corrects whole sentences.
+ *
+ * <p>Implementations are expected to be thread-safe for concurrent {@code 
lookup} calls
+ * once their dictionary has been fully populated <i>and safely published</i> 
to the reading
+ * threads (e.g. stored in a {@code final} field, or otherwise guarded so a 
happens-before
+ * edge exists between population and the first read). Population itself is 
not required to
+ * be thread-safe.</p>
+ */
+public interface SpellChecker {
+
+  /**
+   * Looks up suggestions for a single {@code term} within {@code 
maxEditDistance}.
+   *
+   * @param term           the (possibly misspelled) term to correct; must not 
be {@code null}
+   * @param verbosity      controls how many suggestions are returned
+   * @param maxEditDistance the maximum edit distance to consider; must not be 
negative and
+   *                       must not exceed {@link #maxEditDistance()}
+   * @return the matching suggestions in natural order (best first); never 
{@code null}
+   */
+  List<SuggestItem> lookup(String term, Verbosity verbosity, int 
maxEditDistance);
+
+  /**
+   * @return the largest edit distance this checker can answer queries for 
(the configured
+   *     maximum dictionary edit distance); a {@code maxEditDistance} argument 
to
+   *     {@link #lookup(String, Verbosity, int)} must not exceed this value.
+   */
+  int maxEditDistance();
+
+  /**
+   * Convenience overload that uses {@link Verbosity#TOP} and the 
implementation's
+   * configured maximum dictionary edit distance.
+   *
+   * @param term the (possibly misspelled) term to correct; must not be {@code 
null}
+   * @return the matching suggestions in natural order (best first); never 
{@code null}
+   */
+  List<SuggestItem> lookup(String term);
+
+  /**
+   * Corrects a whole input string (a phrase or sentence), supporting word 
splits and
+   * merges, and combining candidates using a bigram language model.
+   *
+   * @param input           the input phrase to correct; must not be {@code 
null}
+   * @param maxEditDistance the maximum edit distance per token; must not be 
negative
+   * @return a singleton list holding the best correction of the whole input; 
never {@code null}
+   */
+  List<SuggestItem> lookupCompound(String input, int maxEditDistance);

Review Comment:
   What is thrown in case parameter expectations are not met? 
IllegalArgumentException, I'd guess? Please add appropriate `@throws` Javadoc 
for this.



##########
opennlp-extensions/opennlp-spellcheck/src/main/java/opennlp/spellcheck/Verbosity.java:
##########
@@ -0,0 +1,45 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package opennlp.spellcheck;
+
+/**
+ * Controls how many suggestions a {@link SpellChecker#lookup} call returns and
+ * with how much effort they are gathered.
+ *
+ * <p>The semantics follow the SymSpell reference implementation.</p>

Review Comment:
   Please add an external reference here towards the official "SymSpell 
reference implementation" website.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to