7) (opennlp)

via GitHub Wed, 01 Jul 2026 23:05:08 -0700


mawiesne commented on code in PR #1109:
URL: https://github.com/apache/opennlp/pull/1109#discussion_r3510785265



##########
opennlp-api/src/main/java/opennlp/tools/util/normalizer/CharClass.java:
##########
@@ -300,6 +300,175 @@ public String removeAll(CharSequence text) {
     return out.toString();
   }
 
+  /**
+   * Like {@link #normalize(CharSequence)} but also produces the {@link 
Alignment} back to the
+   * original text.
+   *
+   * @param text The text to normalize.
+   * @return The normalized text and its alignment.
+   */
+  public AlignedText normalizeAligned(CharSequence text) {
+    Objects.requireNonNull(text, "text");
+    final StringBuilder out = new StringBuilder(text.length());
+    final Alignment.Builder alignment = new Alignment.Builder();
+    final int length = text.length();
+    int i = 0;
+    while (i < length) {
+      final int codePoint = Character.codePointAt(text, i);
+      final int charCount = Character.charCount(codePoint);
+      if (members.contains(codePoint)) {
+        out.appendCodePoint(replacement);
+        alignment.replace(charCount, Character.charCount(replacement));
+      } else {
+        out.appendCodePoint(codePoint);
+        alignment.equal(charCount);
+      }
+      i += charCount;
+    }
+    return new AlignedText(text, out.toString(), alignment.build(length));
+  }
+
+  /**
+   * Like {@link #collapse(CharSequence)} but also produces the {@link 
Alignment} back to the
+   * original text. Each collapsed run maps to the run's whole original extent.
+   *
+   * @param text The text to collapse.
+   * @return The collapsed text and its alignment.
+   */
+  public AlignedText collapseAligned(CharSequence text) {
+    Objects.requireNonNull(text, "text");

Review Comment:
   Please IllegalArgumentException instead of requireNonNull's 
NullPointerException.
   Add proper doc for IllegalArgumentException to Javadoc here.



##########
opennlp-api/src/main/java/opennlp/tools/util/normalizer/Alignment.java:
##########
@@ -0,0 +1,293 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package opennlp.tools.util.normalizer;
+
+import java.util.Arrays;
+
+import opennlp.tools.util.Span;
+
+/**
+ * A bidirectional alignment between an original text and a normalized form of 
it.
+ *
+ * <p>Normalization edits text in ways that move character offsets: a run of 
whitespace collapses to
+ * one space, a supplementary dash folds to a single ASCII hyphen, a case fold 
can grow text
+ * (German {@code eszett} to {@code ss}), and trimming or stripping deletes 
characters outright. An
+ * {@code Alignment} records those edits as a sequence of <em>equal</em> runs 
(text copied through
+ * unchanged in length) and <em>replace</em> runs (a block of original 
characters that produced a
+ * block of normalized characters), so any span in either form can be mapped 
to the other.</p>
+ *
+ * <p>Because it represents deletions as gaps and expansions as shared blocks 
(rather than storing a
+ * single original offset per normalized character, which would assume the 
normalized text
+ * contiguously covers the original), mapping is done
+ * span to span ({@link #toOriginalSpan(int, int)} / {@link 
#toNormalizedSpan(int, int)}) so a match
+ * that ends next to deleted text reports a tight span rather than 
over-covering the deletion. Two
+ * alignments compose with {@link #andThen(Alignment)}, which is what lets a 
multi-stage
+ * normalization pipeline still map a result all the way back to the 
original.</p>
+ *
+ * <p>Instances are immutable and thread-safe; build one with {@link 
Builder}.</p>

Review Comment:
   Please add the annotation `@ThreadSafe` to the class here.



##########
opennlp-api/src/main/java/opennlp/tools/util/normalizer/Alignment.java:
##########
@@ -0,0 +1,293 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package opennlp.tools.util.normalizer;
+
+import java.util.Arrays;
+
+import opennlp.tools.util.Span;
+
+/**
+ * A bidirectional alignment between an original text and a normalized form of 
it.
+ *
+ * <p>Normalization edits text in ways that move character offsets: a run of 
whitespace collapses to
+ * one space, a supplementary dash folds to a single ASCII hyphen, a case fold 
can grow text
+ * (German {@code eszett} to {@code ss}), and trimming or stripping deletes 
characters outright. An
+ * {@code Alignment} records those edits as a sequence of <em>equal</em> runs 
(text copied through
+ * unchanged in length) and <em>replace</em> runs (a block of original 
characters that produced a
+ * block of normalized characters), so any span in either form can be mapped 
to the other.</p>
+ *
+ * <p>Because it represents deletions as gaps and expansions as shared blocks 
(rather than storing a
+ * single original offset per normalized character, which would assume the 
normalized text
+ * contiguously covers the original), mapping is done
+ * span to span ({@link #toOriginalSpan(int, int)} / {@link 
#toNormalizedSpan(int, int)}) so a match
+ * that ends next to deleted text reports a tight span rather than 
over-covering the deletion. Two
+ * alignments compose with {@link #andThen(Alignment)}, which is what lets a 
multi-stage
+ * normalization pipeline still map a result all the way back to the 
original.</p>
+ *
+ * <p>Instances are immutable and thread-safe; build one with {@link 
Builder}.</p>
+ */
+public final class Alignment {
+
+  // For normalized character k, originalStart[k]/originalEnd[k] are the 
half-open original range it
+  // was produced from. Characters copied unchanged map one to one; characters 
from a collapse or
+  // expansion share their run's whole original range (it cannot be 
subdivided); deleted original
+  // characters appear as a gap that no normalized character covers.
+  private final int[] originalStart;
+  private final int[] originalEnd;
+  private final int originalLength;
+
+  private Alignment(int[] originalStart, int[] originalEnd, int 
originalLength) {
+    this.originalStart = originalStart;
+    this.originalEnd = originalEnd;
+    this.originalLength = originalLength;
+  }
+
+  /** {@return the length of the normalized text this alignment was built for} 
*/
+  public int normalizedLength() {
+    return originalStart.length;
+  }
+
+  /** {@return the length of the original text this alignment was built for} */
+  public int originalLength() {
+    return originalLength;
+  }
+
+  /**
+   * Maps a half-open span of the normalized text to the tightest half-open 
span of the original
+   * text that produced it.
+   *
+   * @param normalizedStart The inclusive start offset, in {@code [0, 
normalizedLength()]}.
+   * @param normalizedEnd   The exclusive end offset, in {@code 
[normalizedStart, normalizedLength()]}.
+   * @return The corresponding original span.
+   * @throws IndexOutOfBoundsException Thrown if the offsets are out of range 
or inverted.
+   */
+  public Span toOriginalSpan(int normalizedStart, int normalizedEnd) {
+    checkRange(normalizedStart, normalizedEnd, normalizedLength());
+    if (normalizedStart == normalizedEnd) {
+      final int at = normalizedStart < normalizedLength()
+          ? originalStart[normalizedStart] : originalLength;
+      return new Span(at, at);
+    }
+    return new Span(originalStart[normalizedStart], originalEnd[normalizedEnd 
- 1]);
+  }
+
+  /**
+   * Maps a half-open span of the original text to the half-open span of the 
normalized text that
+   * covers it. Original characters that were deleted map to an empty span at 
the point where they
+   * were removed.
+   *
+   * @param originalStartOffset The inclusive start offset, in {@code [0, 
originalLength()]}.
+   * @param originalEndOffset   The exclusive end offset, in {@code 
[originalStartOffset, originalLength()]}.
+   * @return The corresponding normalized span.
+   * @throws IndexOutOfBoundsException Thrown if the offsets are out of range 
or inverted.
+   */
+  public Span toNormalizedSpan(int originalStartOffset, int originalEndOffset) 
{
+    checkRange(originalStartOffset, originalEndOffset, originalLength);
+    final int start = firstIndexEndingAfter(originalStartOffset);
+    final int end = firstIndexStartingAtOrAfter(originalEndOffset);
+    return new Span(start, Math.max(start, end));
+  }
+
+  /**
+   * Maps a normalized offset to the original offset where its character 
begins (start semantics).
+   * Prefer {@link #toOriginalSpan(int, int)} for mapping a match, since a 
single offset cannot
+   * distinguish the start and end of a span across a deletion.
+   *
+   * @param normalizedOffset An offset in {@code [0, normalizedLength()]}.
+   * @return The corresponding original offset.
+   * @throws IndexOutOfBoundsException Thrown if {@code normalizedOffset} is 
out of range.
+   */
+  public int toOriginalOffset(int normalizedOffset) {
+    if (normalizedOffset < 0 || normalizedOffset > normalizedLength()) {
+      throw new IndexOutOfBoundsException("normalized offset " + 
normalizedOffset
+          + " is outside [0, " + normalizedLength() + "]");
+    }
+    return normalizedOffset < normalizedLength() ? 
originalStart[normalizedOffset] : originalLength;
+  }
+
+  /**
+   * Composes this alignment with one that further normalizes this alignment's 
normalized text.
+   *
+   * <p>If this maps {@code original -> middle} and {@code next} maps {@code 
middle -> final}, the
+   * result maps {@code original -> final} directly, so a span found in the 
final text can be mapped
+   * straight back to the original without keeping the intermediate stages.</p>
+   *
+   * @param next The next stage, whose original side is this stage's 
normalized text.
+   * @return The composed alignment.
+   * @throws IllegalArgumentException Thrown if {@code next.originalLength()} 
does not equal this
+   *     {@code normalizedLength()} (the stages do not line up).
+   */
+  public Alignment andThen(Alignment next) {
+    if (next.originalLength != normalizedLength()) {
+      throw new IllegalArgumentException("stages do not line up: this 
normalizedLength="
+          + normalizedLength() + " but next originalLength=" + 
next.originalLength);
+    }
+    final int finalLength = next.normalizedLength();
+    final int[] starts = new int[finalLength];
+    final int[] ends = new int[finalLength];
+    for (int f = 0; f < finalLength; f++) {
+      final int middleStart = next.originalStart[f];
+      final int middleEnd = next.originalEnd[f];
+      final int start = middleStart < normalizedLength() ? 
originalStart[middleStart] : originalLength;
+      final int end = middleEnd > 0 ? originalEnd[middleEnd - 1] : 0;
+      starts[f] = start;
+      // Math.max keeps the original span non-inverted. When next inserted 
this final character
+      // (a zero-width middle range, middleStart == middleEnd) the max 
collapses it to a zero-width
+      // original span -- correct for every insertion except one landing 
strictly inside an
+      // expansion this stage produced, where the characters on either side 
share one atomic
+      // original block (originalEnd[middleEnd - 1] > 
originalStart[middleStart]) that has no
+      // interior offset to point at. There the insertion is attributed to 
that whole block, the
+      // only choice that keeps originalStart/originalEnd sorted so 
toOriginalSpan/toNormalizedSpan
+      // keep their O(log n) search; forcing it to zero-width would push 
originalEnd below its
+      // predecessor and corrupt the reverse mapping.
+      ends[f] = Math.max(start, end);
+    }
+    return new Alignment(starts, ends, originalLength);
+  }
+
+  // First normalized index whose original coverage ends strictly after offset 
(so it covers or
+  // follows offset); normalizedLength() when offset is at or past the last 
covered original char.
+  private int firstIndexEndingAfter(int offset) {
+    int low = 0;
+    int high = originalEnd.length;
+    while (low < high) {
+      final int mid = (low + high) >>> 1;
+      if (originalEnd[mid] > offset) {
+        high = mid;
+      } else {
+        low = mid + 1;
+      }
+    }
+    return low;
+  }
+
+  // First normalized index whose original coverage starts at or after offset.
+  private int firstIndexStartingAtOrAfter(int offset) {
+    int low = 0;
+    int high = originalStart.length;
+    while (low < high) {
+      final int mid = (low + high) >>> 1;
+      if (originalStart[mid] >= offset) {
+        high = mid;
+      } else {
+        low = mid + 1;
+      }
+    }
+    return low;
+  }
+
+  private static void checkRange(int start, int end, int length) {
+    if (start < 0 || end > length || start > end) {
+      throw new IndexOutOfBoundsException("span [" + start + ", " + end + ") 
is outside [0, "
+          + length + "]");
+    }
+  }
+
+  /**
+   * Builds an {@link Alignment} as the normalized text is produced, by 
recording each edit in order.
+   * Call {@link #equal(int)} for characters copied through unchanged and 
{@link #replace(int, int)}
+   * for a block that was rewritten (including deletions and insertions), then 
{@link #build(int)}.
+   */
+  public static final class Builder {
+
+    private static final int MAX_ARRAY_SIZE = Integer.MAX_VALUE - 8;
+
+    private int[] starts = new int[16];
+    private int[] ends = new int[16];
+    private int count;
+    private int originalCursor;
+
+    /**
+     * Records {@code charCount} characters copied through unchanged (a one to 
one run).
+     *
+     * @param charCount The number of UTF-16 characters; must not be negative.
+     * @return This builder.
+     */
+    public Builder equal(int charCount) {
+      if (charCount < 0) {
+        throw new IllegalArgumentException("charCount must not be negative: " 
+ charCount);

Review Comment:
   Please add IllegalArgumentException to Javadoc here.



##########
opennlp-api/src/main/java/opennlp/tools/util/normalizer/CharClass.java:
##########
@@ -300,6 +300,175 @@ public String removeAll(CharSequence text) {
     return out.toString();
   }
 
+  /**
+   * Like {@link #normalize(CharSequence)} but also produces the {@link 
Alignment} back to the
+   * original text.
+   *
+   * @param text The text to normalize.
+   * @return The normalized text and its alignment.
+   */
+  public AlignedText normalizeAligned(CharSequence text) {
+    Objects.requireNonNull(text, "text");

Review Comment:
   Please IllegalArgumentException instead of requireNonNull's 
NullPointerException.
   Add proper doc for IllegalArgumentException to Javadoc here.



##########
opennlp-api/src/main/java/opennlp/tools/util/normalizer/Alignment.java:
##########
@@ -0,0 +1,293 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package opennlp.tools.util.normalizer;
+
+import java.util.Arrays;
+
+import opennlp.tools.util.Span;
+
+/**
+ * A bidirectional alignment between an original text and a normalized form of 
it.
+ *
+ * <p>Normalization edits text in ways that move character offsets: a run of 
whitespace collapses to
+ * one space, a supplementary dash folds to a single ASCII hyphen, a case fold 
can grow text
+ * (German {@code eszett} to {@code ss}), and trimming or stripping deletes 
characters outright. An
+ * {@code Alignment} records those edits as a sequence of <em>equal</em> runs 
(text copied through
+ * unchanged in length) and <em>replace</em> runs (a block of original 
characters that produced a
+ * block of normalized characters), so any span in either form can be mapped 
to the other.</p>
+ *
+ * <p>Because it represents deletions as gaps and expansions as shared blocks 
(rather than storing a
+ * single original offset per normalized character, which would assume the 
normalized text
+ * contiguously covers the original), mapping is done
+ * span to span ({@link #toOriginalSpan(int, int)} / {@link 
#toNormalizedSpan(int, int)}) so a match
+ * that ends next to deleted text reports a tight span rather than 
over-covering the deletion. Two
+ * alignments compose with {@link #andThen(Alignment)}, which is what lets a 
multi-stage
+ * normalization pipeline still map a result all the way back to the 
original.</p>
+ *
+ * <p>Instances are immutable and thread-safe; build one with {@link 
Builder}.</p>
+ */
+public final class Alignment {
+
+  // For normalized character k, originalStart[k]/originalEnd[k] are the 
half-open original range it
+  // was produced from. Characters copied unchanged map one to one; characters 
from a collapse or
+  // expansion share their run's whole original range (it cannot be 
subdivided); deleted original
+  // characters appear as a gap that no normalized character covers.
+  private final int[] originalStart;
+  private final int[] originalEnd;
+  private final int originalLength;
+
+  private Alignment(int[] originalStart, int[] originalEnd, int 
originalLength) {
+    this.originalStart = originalStart;
+    this.originalEnd = originalEnd;
+    this.originalLength = originalLength;
+  }
+
+  /** {@return the length of the normalized text this alignment was built for} 
*/
+  public int normalizedLength() {
+    return originalStart.length;
+  }
+
+  /** {@return the length of the original text this alignment was built for} */
+  public int originalLength() {
+    return originalLength;
+  }
+
+  /**
+   * Maps a half-open span of the normalized text to the tightest half-open 
span of the original
+   * text that produced it.
+   *
+   * @param normalizedStart The inclusive start offset, in {@code [0, 
normalizedLength()]}.
+   * @param normalizedEnd   The exclusive end offset, in {@code 
[normalizedStart, normalizedLength()]}.
+   * @return The corresponding original span.
+   * @throws IndexOutOfBoundsException Thrown if the offsets are out of range 
or inverted.
+   */
+  public Span toOriginalSpan(int normalizedStart, int normalizedEnd) {
+    checkRange(normalizedStart, normalizedEnd, normalizedLength());
+    if (normalizedStart == normalizedEnd) {
+      final int at = normalizedStart < normalizedLength()
+          ? originalStart[normalizedStart] : originalLength;
+      return new Span(at, at);
+    }
+    return new Span(originalStart[normalizedStart], originalEnd[normalizedEnd 
- 1]);
+  }
+
+  /**
+   * Maps a half-open span of the original text to the half-open span of the 
normalized text that
+   * covers it. Original characters that were deleted map to an empty span at 
the point where they
+   * were removed.
+   *
+   * @param originalStartOffset The inclusive start offset, in {@code [0, 
originalLength()]}.
+   * @param originalEndOffset   The exclusive end offset, in {@code 
[originalStartOffset, originalLength()]}.
+   * @return The corresponding normalized span.
+   * @throws IndexOutOfBoundsException Thrown if the offsets are out of range 
or inverted.
+   */
+  public Span toNormalizedSpan(int originalStartOffset, int originalEndOffset) 
{
+    checkRange(originalStartOffset, originalEndOffset, originalLength);
+    final int start = firstIndexEndingAfter(originalStartOffset);
+    final int end = firstIndexStartingAtOrAfter(originalEndOffset);
+    return new Span(start, Math.max(start, end));
+  }
+
+  /**
+   * Maps a normalized offset to the original offset where its character 
begins (start semantics).
+   * Prefer {@link #toOriginalSpan(int, int)} for mapping a match, since a 
single offset cannot
+   * distinguish the start and end of a span across a deletion.
+   *
+   * @param normalizedOffset An offset in {@code [0, normalizedLength()]}.
+   * @return The corresponding original offset.
+   * @throws IndexOutOfBoundsException Thrown if {@code normalizedOffset} is 
out of range.
+   */
+  public int toOriginalOffset(int normalizedOffset) {
+    if (normalizedOffset < 0 || normalizedOffset > normalizedLength()) {
+      throw new IndexOutOfBoundsException("normalized offset " + 
normalizedOffset
+          + " is outside [0, " + normalizedLength() + "]");
+    }
+    return normalizedOffset < normalizedLength() ? 
originalStart[normalizedOffset] : originalLength;
+  }
+
+  /**
+   * Composes this alignment with one that further normalizes this alignment's 
normalized text.
+   *
+   * <p>If this maps {@code original -> middle} and {@code next} maps {@code 
middle -> final}, the
+   * result maps {@code original -> final} directly, so a span found in the 
final text can be mapped
+   * straight back to the original without keeping the intermediate stages.</p>
+   *
+   * @param next The next stage, whose original side is this stage's 
normalized text.
+   * @return The composed alignment.
+   * @throws IllegalArgumentException Thrown if {@code next.originalLength()} 
does not equal this
+   *     {@code normalizedLength()} (the stages do not line up).
+   */
+  public Alignment andThen(Alignment next) {
+    if (next.originalLength != normalizedLength()) {
+      throw new IllegalArgumentException("stages do not line up: this 
normalizedLength="
+          + normalizedLength() + " but next originalLength=" + 
next.originalLength);
+    }
+    final int finalLength = next.normalizedLength();
+    final int[] starts = new int[finalLength];
+    final int[] ends = new int[finalLength];
+    for (int f = 0; f < finalLength; f++) {
+      final int middleStart = next.originalStart[f];
+      final int middleEnd = next.originalEnd[f];
+      final int start = middleStart < normalizedLength() ? 
originalStart[middleStart] : originalLength;
+      final int end = middleEnd > 0 ? originalEnd[middleEnd - 1] : 0;
+      starts[f] = start;
+      // Math.max keeps the original span non-inverted. When next inserted 
this final character
+      // (a zero-width middle range, middleStart == middleEnd) the max 
collapses it to a zero-width
+      // original span -- correct for every insertion except one landing 
strictly inside an
+      // expansion this stage produced, where the characters on either side 
share one atomic
+      // original block (originalEnd[middleEnd - 1] > 
originalStart[middleStart]) that has no
+      // interior offset to point at. There the insertion is attributed to 
that whole block, the
+      // only choice that keeps originalStart/originalEnd sorted so 
toOriginalSpan/toNormalizedSpan
+      // keep their O(log n) search; forcing it to zero-width would push 
originalEnd below its
+      // predecessor and corrupt the reverse mapping.
+      ends[f] = Math.max(start, end);
+    }
+    return new Alignment(starts, ends, originalLength);
+  }
+
+  // First normalized index whose original coverage ends strictly after offset 
(so it covers or
+  // follows offset); normalizedLength() when offset is at or past the last 
covered original char.
+  private int firstIndexEndingAfter(int offset) {
+    int low = 0;
+    int high = originalEnd.length;
+    while (low < high) {
+      final int mid = (low + high) >>> 1;
+      if (originalEnd[mid] > offset) {
+        high = mid;
+      } else {
+        low = mid + 1;
+      }
+    }
+    return low;
+  }
+
+  // First normalized index whose original coverage starts at or after offset.
+  private int firstIndexStartingAtOrAfter(int offset) {
+    int low = 0;
+    int high = originalStart.length;
+    while (low < high) {
+      final int mid = (low + high) >>> 1;
+      if (originalStart[mid] >= offset) {
+        high = mid;
+      } else {
+        low = mid + 1;
+      }
+    }
+    return low;
+  }
+
+  private static void checkRange(int start, int end, int length) {
+    if (start < 0 || end > length || start > end) {
+      throw new IndexOutOfBoundsException("span [" + start + ", " + end + ") 
is outside [0, "
+          + length + "]");
+    }
+  }
+
+  /**
+   * Builds an {@link Alignment} as the normalized text is produced, by 
recording each edit in order.
+   * Call {@link #equal(int)} for characters copied through unchanged and 
{@link #replace(int, int)}
+   * for a block that was rewritten (including deletions and insertions), then 
{@link #build(int)}.
+   */
+  public static final class Builder {
+
+    private static final int MAX_ARRAY_SIZE = Integer.MAX_VALUE - 8;
+
+    private int[] starts = new int[16];
+    private int[] ends = new int[16];
+    private int count;
+    private int originalCursor;
+
+    /**
+     * Records {@code charCount} characters copied through unchanged (a one to 
one run).
+     *
+     * @param charCount The number of UTF-16 characters; must not be negative.
+     * @return This builder.
+     */
+    public Builder equal(int charCount) {
+      if (charCount < 0) {
+        throw new IllegalArgumentException("charCount must not be negative: " 
+ charCount);
+      }
+      for (int i = 0; i < charCount; i++) {
+        append(originalCursor, originalCursor + 1);
+        originalCursor++;
+      }
+      return this;
+    }
+
+    /**
+     * Records a rewritten block: {@code originalCount} original characters 
that produced
+     * {@code normalizedCount} normalized characters. Each produced character 
is attributed to the
+     * whole original block, since a collapse or expansion cannot be 
subdivided. {@code 0} for
+     * {@code normalizedCount} is a deletion; {@code 0} for {@code 
originalCount} is an insertion.
+     *
+     * @param originalCount   The number of original characters consumed; must 
not be negative.
+     * @param normalizedCount The number of normalized characters produced; 
must not be negative.
+     * @return This builder.
+     */
+    public Builder replace(int originalCount, int normalizedCount) {
+      if (originalCount < 0 || normalizedCount < 0) {
+        throw new IllegalArgumentException("counts must not be negative: " + 
originalCount

Review Comment:
   Please add IllegalArgumentException to Javadoc here.



##########
opennlp-api/src/main/java/opennlp/tools/util/normalizer/CharClass.java:
##########
@@ -300,6 +300,175 @@ public String removeAll(CharSequence text) {
     return out.toString();
   }
 
+  /**
+   * Like {@link #normalize(CharSequence)} but also produces the {@link 
Alignment} back to the
+   * original text.
+   *
+   * @param text The text to normalize.
+   * @return The normalized text and its alignment.
+   */
+  public AlignedText normalizeAligned(CharSequence text) {
+    Objects.requireNonNull(text, "text");
+    final StringBuilder out = new StringBuilder(text.length());
+    final Alignment.Builder alignment = new Alignment.Builder();
+    final int length = text.length();
+    int i = 0;
+    while (i < length) {
+      final int codePoint = Character.codePointAt(text, i);
+      final int charCount = Character.charCount(codePoint);
+      if (members.contains(codePoint)) {
+        out.appendCodePoint(replacement);
+        alignment.replace(charCount, Character.charCount(replacement));
+      } else {
+        out.appendCodePoint(codePoint);
+        alignment.equal(charCount);
+      }
+      i += charCount;
+    }
+    return new AlignedText(text, out.toString(), alignment.build(length));
+  }
+
+  /**
+   * Like {@link #collapse(CharSequence)} but also produces the {@link 
Alignment} back to the
+   * original text. Each collapsed run maps to the run's whole original extent.
+   *
+   * @param text The text to collapse.
+   * @return The collapsed text and its alignment.
+   */
+  public AlignedText collapseAligned(CharSequence text) {
+    Objects.requireNonNull(text, "text");
+    final StringBuilder out = new StringBuilder(text.length());
+    final Alignment.Builder alignment = new Alignment.Builder();
+    final int length = text.length();
+    int i = 0;
+    while (i < length) {
+      final int codePoint = Character.codePointAt(text, i);
+      if (members.contains(codePoint)) {
+        final int runEnd = skipRun(text, i);
+        out.appendCodePoint(replacement);
+        alignment.replace(runEnd - i, Character.charCount(replacement));
+        i = runEnd;
+      } else {
+        final int charCount = Character.charCount(codePoint);
+        out.appendCodePoint(codePoint);
+        alignment.equal(charCount);
+        i += charCount;
+      }
+    }
+    return new AlignedText(text, out.toString(), alignment.build(length));
+  }
+
+  /**
+   * Like {@link #collapsePreserving(CharSequence, CodePointSet, int)} but 
also produces the
+   * {@link Alignment} back to the original text.
+   *
+   * @param text The text to collapse.
+   * @param keep The member code points whose presence in a run preserves 
structure.
+   * @param keepReplacement The replacement emitted for a run that contains a 
{@code keep} member.
+   * @return The collapsed text and its alignment.
+   * @throws IllegalArgumentException Thrown if {@code keepReplacement} is not 
a valid code point.
+   */
+  public AlignedText collapsePreservingAligned(CharSequence text, CodePointSet 
keep,
+                                               int keepReplacement) {
+    Objects.requireNonNull(text, "text");
+    Objects.requireNonNull(keep, "keep");
+    requireValidCodePoint(keepReplacement);
+    final StringBuilder out = new StringBuilder(text.length());
+    final Alignment.Builder alignment = new Alignment.Builder();
+    final int length = text.length();
+    int i = 0;
+    while (i < length) {
+      final int codePoint = Character.codePointAt(text, i);
+      if (members.contains(codePoint)) {
+        boolean preserve = keep.contains(codePoint);
+        int j = i + Character.charCount(codePoint);
+        while (j < length) {
+          final int next = Character.codePointAt(text, j);
+          if (!members.contains(next)) {
+            break;
+          }
+          preserve |= keep.contains(next);
+          j += Character.charCount(next);
+        }
+        final int emitted = preserve ? keepReplacement : replacement;
+        out.appendCodePoint(emitted);
+        alignment.replace(j - i, Character.charCount(emitted));
+        i = j;
+      } else {
+        final int charCount = Character.charCount(codePoint);
+        out.appendCodePoint(codePoint);
+        alignment.equal(charCount);
+        i += charCount;
+      }
+    }
+    return new AlignedText(text, out.toString(), alignment.build(length));
+  }
+
+  /**
+   * Like {@link #trim(CharSequence)} but also produces the {@link Alignment} 
back to the original
+   * text. The trimmed leading and trailing members appear as deletions, so a 
span never reports
+   * through them.
+   *
+   * @param text The text to trim.
+   * @return The trimmed text and its alignment.
+   */
+  public AlignedText trimAligned(CharSequence text) {
+    Objects.requireNonNull(text, "text");

Review Comment:
   Please IllegalArgumentException instead of requireNonNull's 
NullPointerException.
   Adjust doc for IllegalArgumentException to Javadoc here.



##########
opennlp-api/src/main/java/opennlp/tools/util/normalizer/CharClass.java:
##########
@@ -329,6 +498,37 @@ public static String substitute(CharSequence text, 
IntFunction<String> substitut
     return out.toString();
   }
 
+  /**
+   * Like {@link #substitute(CharSequence, IntFunction)} but also produces the 
{@link Alignment} back
+   * to the original text. Each replaced code point maps to its replacement 
string as one block.
+   *
+   * @param text         The text to transform.
+   * @param substitution The replacement for a code point, or {@code null} to 
copy it through.
+   * @return The transformed text and its alignment.
+   */
+  public static AlignedText substituteAligned(CharSequence text, 
IntFunction<String> substitution) {
+    Objects.requireNonNull(text, "text");

Review Comment:
   Please IllegalArgumentException instead of requireNonNull's 
NullPointerException.
   Same for second parameter substitution.
   Add doc for IllegalArgumentException to Javadoc here. It should be generic 
for any parameter violation.



##########
opennlp-api/src/main/java/opennlp/tools/util/normalizer/CharClass.java:
##########
@@ -300,6 +300,175 @@ public String removeAll(CharSequence text) {
     return out.toString();
   }
 
+  /**
+   * Like {@link #normalize(CharSequence)} but also produces the {@link 
Alignment} back to the
+   * original text.
+   *
+   * @param text The text to normalize.
+   * @return The normalized text and its alignment.
+   */
+  public AlignedText normalizeAligned(CharSequence text) {
+    Objects.requireNonNull(text, "text");
+    final StringBuilder out = new StringBuilder(text.length());
+    final Alignment.Builder alignment = new Alignment.Builder();
+    final int length = text.length();
+    int i = 0;
+    while (i < length) {
+      final int codePoint = Character.codePointAt(text, i);
+      final int charCount = Character.charCount(codePoint);
+      if (members.contains(codePoint)) {
+        out.appendCodePoint(replacement);
+        alignment.replace(charCount, Character.charCount(replacement));
+      } else {
+        out.appendCodePoint(codePoint);
+        alignment.equal(charCount);
+      }
+      i += charCount;
+    }
+    return new AlignedText(text, out.toString(), alignment.build(length));
+  }
+
+  /**
+   * Like {@link #collapse(CharSequence)} but also produces the {@link 
Alignment} back to the
+   * original text. Each collapsed run maps to the run's whole original extent.
+   *
+   * @param text The text to collapse.
+   * @return The collapsed text and its alignment.
+   */
+  public AlignedText collapseAligned(CharSequence text) {
+    Objects.requireNonNull(text, "text");
+    final StringBuilder out = new StringBuilder(text.length());
+    final Alignment.Builder alignment = new Alignment.Builder();
+    final int length = text.length();
+    int i = 0;
+    while (i < length) {
+      final int codePoint = Character.codePointAt(text, i);
+      if (members.contains(codePoint)) {
+        final int runEnd = skipRun(text, i);
+        out.appendCodePoint(replacement);
+        alignment.replace(runEnd - i, Character.charCount(replacement));
+        i = runEnd;
+      } else {
+        final int charCount = Character.charCount(codePoint);
+        out.appendCodePoint(codePoint);
+        alignment.equal(charCount);
+        i += charCount;
+      }
+    }
+    return new AlignedText(text, out.toString(), alignment.build(length));
+  }
+
+  /**
+   * Like {@link #collapsePreserving(CharSequence, CodePointSet, int)} but 
also produces the
+   * {@link Alignment} back to the original text.
+   *
+   * @param text The text to collapse.
+   * @param keep The member code points whose presence in a run preserves 
structure.
+   * @param keepReplacement The replacement emitted for a run that contains a 
{@code keep} member.
+   * @return The collapsed text and its alignment.
+   * @throws IllegalArgumentException Thrown if {@code keepReplacement} is not 
a valid code point.
+   */
+  public AlignedText collapsePreservingAligned(CharSequence text, CodePointSet 
keep,
+                                               int keepReplacement) {
+    Objects.requireNonNull(text, "text");
+    Objects.requireNonNull(keep, "keep");
+    requireValidCodePoint(keepReplacement);
+    final StringBuilder out = new StringBuilder(text.length());
+    final Alignment.Builder alignment = new Alignment.Builder();
+    final int length = text.length();
+    int i = 0;
+    while (i < length) {
+      final int codePoint = Character.codePointAt(text, i);
+      if (members.contains(codePoint)) {
+        boolean preserve = keep.contains(codePoint);
+        int j = i + Character.charCount(codePoint);
+        while (j < length) {
+          final int next = Character.codePointAt(text, j);
+          if (!members.contains(next)) {
+            break;
+          }
+          preserve |= keep.contains(next);
+          j += Character.charCount(next);
+        }
+        final int emitted = preserve ? keepReplacement : replacement;
+        out.appendCodePoint(emitted);
+        alignment.replace(j - i, Character.charCount(emitted));
+        i = j;
+      } else {
+        final int charCount = Character.charCount(codePoint);
+        out.appendCodePoint(codePoint);
+        alignment.equal(charCount);
+        i += charCount;
+      }
+    }
+    return new AlignedText(text, out.toString(), alignment.build(length));
+  }
+
+  /**
+   * Like {@link #trim(CharSequence)} but also produces the {@link Alignment} 
back to the original
+   * text. The trimmed leading and trailing members appear as deletions, so a 
span never reports
+   * through them.
+   *
+   * @param text The text to trim.
+   * @return The trimmed text and its alignment.
+   */
+  public AlignedText trimAligned(CharSequence text) {
+    Objects.requireNonNull(text, "text");
+    final int length = text.length();
+    int start = 0;
+    while (start < length) {
+      final int codePoint = Character.codePointAt(text, start);
+      if (!members.contains(codePoint)) {
+        break;
+      }
+      start += Character.charCount(codePoint);
+    }
+    int end = length;
+    while (end > start) {
+      final int codePoint = Character.codePointBefore(text, end);
+      if (!members.contains(codePoint)) {
+        break;
+      }
+      end -= Character.charCount(codePoint);
+    }
+    final Alignment.Builder alignment = new Alignment.Builder();
+    if (start > 0) {
+      alignment.replace(start, 0);
+    }
+    alignment.equal(end - start);
+    if (end < length) {
+      alignment.replace(length - end, 0);
+    }
+    return new AlignedText(text, text.subSequence(start, end).toString(), 
alignment.build(length));
+  }
+
+  /**
+   * Like {@link #removeAll(CharSequence)} but also produces the {@link 
Alignment} back to the
+   * original text. Every removed member appears as a deletion, so a span 
never reports through one.
+   *
+   * @param text The text to filter.
+   * @return The filtered text and its alignment.
+   */
+  public AlignedText removeAllAligned(CharSequence text) {
+    Objects.requireNonNull(text, "text");

Review Comment:
   Please IllegalArgumentException instead of requireNonNull's 
NullPointerException.
   Adjust doc for IllegalArgumentException to Javadoc here.



##########
opennlp-api/src/main/java/opennlp/tools/util/normalizer/CharClass.java:
##########
@@ -300,6 +300,175 @@ public String removeAll(CharSequence text) {
     return out.toString();
   }
 
+  /**
+   * Like {@link #normalize(CharSequence)} but also produces the {@link 
Alignment} back to the
+   * original text.
+   *
+   * @param text The text to normalize.
+   * @return The normalized text and its alignment.
+   */
+  public AlignedText normalizeAligned(CharSequence text) {
+    Objects.requireNonNull(text, "text");
+    final StringBuilder out = new StringBuilder(text.length());
+    final Alignment.Builder alignment = new Alignment.Builder();
+    final int length = text.length();
+    int i = 0;
+    while (i < length) {
+      final int codePoint = Character.codePointAt(text, i);
+      final int charCount = Character.charCount(codePoint);
+      if (members.contains(codePoint)) {
+        out.appendCodePoint(replacement);
+        alignment.replace(charCount, Character.charCount(replacement));
+      } else {
+        out.appendCodePoint(codePoint);
+        alignment.equal(charCount);
+      }
+      i += charCount;
+    }
+    return new AlignedText(text, out.toString(), alignment.build(length));
+  }
+
+  /**
+   * Like {@link #collapse(CharSequence)} but also produces the {@link 
Alignment} back to the
+   * original text. Each collapsed run maps to the run's whole original extent.
+   *
+   * @param text The text to collapse.
+   * @return The collapsed text and its alignment.
+   */
+  public AlignedText collapseAligned(CharSequence text) {
+    Objects.requireNonNull(text, "text");
+    final StringBuilder out = new StringBuilder(text.length());
+    final Alignment.Builder alignment = new Alignment.Builder();
+    final int length = text.length();
+    int i = 0;
+    while (i < length) {
+      final int codePoint = Character.codePointAt(text, i);
+      if (members.contains(codePoint)) {
+        final int runEnd = skipRun(text, i);
+        out.appendCodePoint(replacement);
+        alignment.replace(runEnd - i, Character.charCount(replacement));
+        i = runEnd;
+      } else {
+        final int charCount = Character.charCount(codePoint);
+        out.appendCodePoint(codePoint);
+        alignment.equal(charCount);
+        i += charCount;
+      }
+    }
+    return new AlignedText(text, out.toString(), alignment.build(length));
+  }
+
+  /**
+   * Like {@link #collapsePreserving(CharSequence, CodePointSet, int)} but 
also produces the
+   * {@link Alignment} back to the original text.
+   *
+   * @param text The text to collapse.
+   * @param keep The member code points whose presence in a run preserves 
structure.
+   * @param keepReplacement The replacement emitted for a run that contains a 
{@code keep} member.
+   * @return The collapsed text and its alignment.
+   * @throws IllegalArgumentException Thrown if {@code keepReplacement} is not 
a valid code point.
+   */
+  public AlignedText collapsePreservingAligned(CharSequence text, CodePointSet 
keep,
+                                               int keepReplacement) {
+    Objects.requireNonNull(text, "text");

Review Comment:
   Please IllegalArgumentException instead of requireNonNull's 
NullPointerException.
   Same for second parameter `keep`.
   Adjust doc for IllegalArgumentException to Javadoc here, as it should be 
generic for any parameter violation.



##########
opennlp-core/opennlp-runtime/src/test/java/opennlp/tools/util/normalizer/AlignedNormalizerPipelineTest.java:
##########
@@ -0,0 +1,342 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package opennlp.tools.util.normalizer;
+
+import org.junit.jupiter.api.Test;
+
+import opennlp.tools.util.Span;
+
+import static org.junit.jupiter.api.Assertions.assertEquals;
+import static org.junit.jupiter.api.Assertions.assertFalse;
+import static org.junit.jupiter.api.Assertions.assertSame;
+import static org.junit.jupiter.api.Assertions.assertThrows;
+import static org.junit.jupiter.api.Assertions.assertTrue;
+
+/**
+ * Exercises {@link OffsetAwareNormalizer} and {@code 
TextNormalizer.Builder.buildAligned()}: the
+ * cursor-based rungs report alignments, an aligned pipeline composes them with
+ * {@link Alignment#andThen(Alignment)} so a span found in the fully 
normalized text maps back to the
+ * original input, and a non-alignable rung is rejected loudly.
+ */
+public class AlignedNormalizerPipelineTest {

Review Comment:
   Please also add new test class which tests the 
`LineBreakPreservingWhitespaceCharSequenceNormalizer` class. This is currently 
untested in this PR! 
   
   Be hard and always provide tests for new classes.... !



##########
opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/util/normalizer/DashCharSequenceNormalizer.java:
##########
@@ -25,7 +25,7 @@
  * regardless of which dash the source used. The mathematical minus signs are 
left untouched by
  * default, and {@code U+00AD} SOFT HYPHEN (a format character) is not treated 
as a dash.</p>
  */
-public class DashCharSequenceNormalizer implements CharSequenceNormalizer {
+public class DashCharSequenceNormalizer implements OffsetAwareNormalizer {
 
   private static final long serialVersionUID = 6620885194730155303L;

Review Comment:
   The `serialVersionUID` can't be correct as the class now implements a 
different interface (OffsetAwareNormalizer). Hence, it should be regenerated.
   
   Note: All other, similar classes which now implement `OffsetAwareNormalizer` 
need this recomputation of the UID value. I won't leave a comment for every 
single occurrence. Please adjust accordingly on your own.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] OPENNLP-1861: Offset/alignment layer — Alignment, AlignedText, buildAligned (1b/7) (opennlp)

Reply via email to