mawiesne commented on code in PR #1109:
URL: https://github.com/apache/opennlp/pull/1109#discussion_r3510785265
##########
opennlp-api/src/main/java/opennlp/tools/util/normalizer/CharClass.java:
##########
@@ -300,6 +300,175 @@ public String removeAll(CharSequence text) {
return out.toString();
}
+ /**
+ * Like {@link #normalize(CharSequence)} but also produces the {@link
Alignment} back to the
+ * original text.
+ *
+ * @param text The text to normalize.
+ * @return The normalized text and its alignment.
+ */
+ public AlignedText normalizeAligned(CharSequence text) {
+ Objects.requireNonNull(text, "text");
+ final StringBuilder out = new StringBuilder(text.length());
+ final Alignment.Builder alignment = new Alignment.Builder();
+ final int length = text.length();
+ int i = 0;
+ while (i < length) {
+ final int codePoint = Character.codePointAt(text, i);
+ final int charCount = Character.charCount(codePoint);
+ if (members.contains(codePoint)) {
+ out.appendCodePoint(replacement);
+ alignment.replace(charCount, Character.charCount(replacement));
+ } else {
+ out.appendCodePoint(codePoint);
+ alignment.equal(charCount);
+ }
+ i += charCount;
+ }
+ return new AlignedText(text, out.toString(), alignment.build(length));
+ }
+
+ /**
+ * Like {@link #collapse(CharSequence)} but also produces the {@link
Alignment} back to the
+ * original text. Each collapsed run maps to the run's whole original extent.
+ *
+ * @param text The text to collapse.
+ * @return The collapsed text and its alignment.
+ */
+ public AlignedText collapseAligned(CharSequence text) {
+ Objects.requireNonNull(text, "text");
Review Comment:
Please IllegalArgumentException instead of requireNonNull's
NullPointerException.
Add proper doc for IllegalArgumentException to Javadoc here.
##########
opennlp-api/src/main/java/opennlp/tools/util/normalizer/Alignment.java:
##########
@@ -0,0 +1,293 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package opennlp.tools.util.normalizer;
+
+import java.util.Arrays;
+
+import opennlp.tools.util.Span;
+
+/**
+ * A bidirectional alignment between an original text and a normalized form of
it.
+ *
+ * <p>Normalization edits text in ways that move character offsets: a run of
whitespace collapses to
+ * one space, a supplementary dash folds to a single ASCII hyphen, a case fold
can grow text
+ * (German {@code eszett} to {@code ss}), and trimming or stripping deletes
characters outright. An
+ * {@code Alignment} records those edits as a sequence of <em>equal</em> runs
(text copied through
+ * unchanged in length) and <em>replace</em> runs (a block of original
characters that produced a
+ * block of normalized characters), so any span in either form can be mapped
to the other.</p>
+ *
+ * <p>Because it represents deletions as gaps and expansions as shared blocks
(rather than storing a
+ * single original offset per normalized character, which would assume the
normalized text
+ * contiguously covers the original), mapping is done
+ * span to span ({@link #toOriginalSpan(int, int)} / {@link
#toNormalizedSpan(int, int)}) so a match
+ * that ends next to deleted text reports a tight span rather than
over-covering the deletion. Two
+ * alignments compose with {@link #andThen(Alignment)}, which is what lets a
multi-stage
+ * normalization pipeline still map a result all the way back to the
original.</p>
+ *
+ * <p>Instances are immutable and thread-safe; build one with {@link
Builder}.</p>
Review Comment:
Please add the annotation `@ThreadSafe` to the class here.
##########
opennlp-api/src/main/java/opennlp/tools/util/normalizer/Alignment.java:
##########
@@ -0,0 +1,293 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package opennlp.tools.util.normalizer;
+
+import java.util.Arrays;
+
+import opennlp.tools.util.Span;
+
+/**
+ * A bidirectional alignment between an original text and a normalized form of
it.
+ *
+ * <p>Normalization edits text in ways that move character offsets: a run of
whitespace collapses to
+ * one space, a supplementary dash folds to a single ASCII hyphen, a case fold
can grow text
+ * (German {@code eszett} to {@code ss}), and trimming or stripping deletes
characters outright. An
+ * {@code Alignment} records those edits as a sequence of <em>equal</em> runs
(text copied through
+ * unchanged in length) and <em>replace</em> runs (a block of original
characters that produced a
+ * block of normalized characters), so any span in either form can be mapped
to the other.</p>
+ *
+ * <p>Because it represents deletions as gaps and expansions as shared blocks
(rather than storing a
+ * single original offset per normalized character, which would assume the
normalized text
+ * contiguously covers the original), mapping is done
+ * span to span ({@link #toOriginalSpan(int, int)} / {@link
#toNormalizedSpan(int, int)}) so a match
+ * that ends next to deleted text reports a tight span rather than
over-covering the deletion. Two
+ * alignments compose with {@link #andThen(Alignment)}, which is what lets a
multi-stage
+ * normalization pipeline still map a result all the way back to the
original.</p>
+ *
+ * <p>Instances are immutable and thread-safe; build one with {@link
Builder}.</p>
+ */
+public final class Alignment {
+
+ // For normalized character k, originalStart[k]/originalEnd[k] are the
half-open original range it
+ // was produced from. Characters copied unchanged map one to one; characters
from a collapse or
+ // expansion share their run's whole original range (it cannot be
subdivided); deleted original
+ // characters appear as a gap that no normalized character covers.
+ private final int[] originalStart;
+ private final int[] originalEnd;
+ private final int originalLength;
+
+ private Alignment(int[] originalStart, int[] originalEnd, int
originalLength) {
+ this.originalStart = originalStart;
+ this.originalEnd = originalEnd;
+ this.originalLength = originalLength;
+ }
+
+ /** {@return the length of the normalized text this alignment was built for}
*/
+ public int normalizedLength() {
+ return originalStart.length;
+ }
+
+ /** {@return the length of the original text this alignment was built for} */
+ public int originalLength() {
+ return originalLength;
+ }
+
+ /**
+ * Maps a half-open span of the normalized text to the tightest half-open
span of the original
+ * text that produced it.
+ *
+ * @param normalizedStart The inclusive start offset, in {@code [0,
normalizedLength()]}.
+ * @param normalizedEnd The exclusive end offset, in {@code
[normalizedStart, normalizedLength()]}.
+ * @return The corresponding original span.
+ * @throws IndexOutOfBoundsException Thrown if the offsets are out of range
or inverted.
+ */
+ public Span toOriginalSpan(int normalizedStart, int normalizedEnd) {
+ checkRange(normalizedStart, normalizedEnd, normalizedLength());
+ if (normalizedStart == normalizedEnd) {
+ final int at = normalizedStart < normalizedLength()
+ ? originalStart[normalizedStart] : originalLength;
+ return new Span(at, at);
+ }
+ return new Span(originalStart[normalizedStart], originalEnd[normalizedEnd
- 1]);
+ }
+
+ /**
+ * Maps a half-open span of the original text to the half-open span of the
normalized text that
+ * covers it. Original characters that were deleted map to an empty span at
the point where they
+ * were removed.
+ *
+ * @param originalStartOffset The inclusive start offset, in {@code [0,
originalLength()]}.
+ * @param originalEndOffset The exclusive end offset, in {@code
[originalStartOffset, originalLength()]}.
+ * @return The corresponding normalized span.
+ * @throws IndexOutOfBoundsException Thrown if the offsets are out of range
or inverted.
+ */
+ public Span toNormalizedSpan(int originalStartOffset, int originalEndOffset)
{
+ checkRange(originalStartOffset, originalEndOffset, originalLength);
+ final int start = firstIndexEndingAfter(originalStartOffset);
+ final int end = firstIndexStartingAtOrAfter(originalEndOffset);
+ return new Span(start, Math.max(start, end));
+ }
+
+ /**
+ * Maps a normalized offset to the original offset where its character
begins (start semantics).
+ * Prefer {@link #toOriginalSpan(int, int)} for mapping a match, since a
single offset cannot
+ * distinguish the start and end of a span across a deletion.
+ *
+ * @param normalizedOffset An offset in {@code [0, normalizedLength()]}.
+ * @return The corresponding original offset.
+ * @throws IndexOutOfBoundsException Thrown if {@code normalizedOffset} is
out of range.
+ */
+ public int toOriginalOffset(int normalizedOffset) {
+ if (normalizedOffset < 0 || normalizedOffset > normalizedLength()) {
+ throw new IndexOutOfBoundsException("normalized offset " +
normalizedOffset
+ + " is outside [0, " + normalizedLength() + "]");
+ }
+ return normalizedOffset < normalizedLength() ?
originalStart[normalizedOffset] : originalLength;
+ }
+
+ /**
+ * Composes this alignment with one that further normalizes this alignment's
normalized text.
+ *
+ * <p>If this maps {@code original -> middle} and {@code next} maps {@code
middle -> final}, the
+ * result maps {@code original -> final} directly, so a span found in the
final text can be mapped
+ * straight back to the original without keeping the intermediate stages.</p>
+ *
+ * @param next The next stage, whose original side is this stage's
normalized text.
+ * @return The composed alignment.
+ * @throws IllegalArgumentException Thrown if {@code next.originalLength()}
does not equal this
+ * {@code normalizedLength()} (the stages do not line up).
+ */
+ public Alignment andThen(Alignment next) {
+ if (next.originalLength != normalizedLength()) {
+ throw new IllegalArgumentException("stages do not line up: this
normalizedLength="
+ + normalizedLength() + " but next originalLength=" +
next.originalLength);
+ }
+ final int finalLength = next.normalizedLength();
+ final int[] starts = new int[finalLength];
+ final int[] ends = new int[finalLength];
+ for (int f = 0; f < finalLength; f++) {
+ final int middleStart = next.originalStart[f];
+ final int middleEnd = next.originalEnd[f];
+ final int start = middleStart < normalizedLength() ?
originalStart[middleStart] : originalLength;
+ final int end = middleEnd > 0 ? originalEnd[middleEnd - 1] : 0;
+ starts[f] = start;
+ // Math.max keeps the original span non-inverted. When next inserted
this final character
+ // (a zero-width middle range, middleStart == middleEnd) the max
collapses it to a zero-width
+ // original span -- correct for every insertion except one landing
strictly inside an
+ // expansion this stage produced, where the characters on either side
share one atomic
+ // original block (originalEnd[middleEnd - 1] >
originalStart[middleStart]) that has no
+ // interior offset to point at. There the insertion is attributed to
that whole block, the
+ // only choice that keeps originalStart/originalEnd sorted so
toOriginalSpan/toNormalizedSpan
+ // keep their O(log n) search; forcing it to zero-width would push
originalEnd below its
+ // predecessor and corrupt the reverse mapping.
+ ends[f] = Math.max(start, end);
+ }
+ return new Alignment(starts, ends, originalLength);
+ }
+
+ // First normalized index whose original coverage ends strictly after offset
(so it covers or
+ // follows offset); normalizedLength() when offset is at or past the last
covered original char.
+ private int firstIndexEndingAfter(int offset) {
+ int low = 0;
+ int high = originalEnd.length;
+ while (low < high) {
+ final int mid = (low + high) >>> 1;
+ if (originalEnd[mid] > offset) {
+ high = mid;
+ } else {
+ low = mid + 1;
+ }
+ }
+ return low;
+ }
+
+ // First normalized index whose original coverage starts at or after offset.
+ private int firstIndexStartingAtOrAfter(int offset) {
+ int low = 0;
+ int high = originalStart.length;
+ while (low < high) {
+ final int mid = (low + high) >>> 1;
+ if (originalStart[mid] >= offset) {
+ high = mid;
+ } else {
+ low = mid + 1;
+ }
+ }
+ return low;
+ }
+
+ private static void checkRange(int start, int end, int length) {
+ if (start < 0 || end > length || start > end) {
+ throw new IndexOutOfBoundsException("span [" + start + ", " + end + ")
is outside [0, "
+ + length + "]");
+ }
+ }
+
+ /**
+ * Builds an {@link Alignment} as the normalized text is produced, by
recording each edit in order.
+ * Call {@link #equal(int)} for characters copied through unchanged and
{@link #replace(int, int)}
+ * for a block that was rewritten (including deletions and insertions), then
{@link #build(int)}.
+ */
+ public static final class Builder {
+
+ private static final int MAX_ARRAY_SIZE = Integer.MAX_VALUE - 8;
+
+ private int[] starts = new int[16];
+ private int[] ends = new int[16];
+ private int count;
+ private int originalCursor;
+
+ /**
+ * Records {@code charCount} characters copied through unchanged (a one to
one run).
+ *
+ * @param charCount The number of UTF-16 characters; must not be negative.
+ * @return This builder.
+ */
+ public Builder equal(int charCount) {
+ if (charCount < 0) {
+ throw new IllegalArgumentException("charCount must not be negative: "
+ charCount);
Review Comment:
Please add IllegalArgumentException to Javadoc here.
##########
opennlp-api/src/main/java/opennlp/tools/util/normalizer/CharClass.java:
##########
@@ -300,6 +300,175 @@ public String removeAll(CharSequence text) {
return out.toString();
}
+ /**
+ * Like {@link #normalize(CharSequence)} but also produces the {@link
Alignment} back to the
+ * original text.
+ *
+ * @param text The text to normalize.
+ * @return The normalized text and its alignment.
+ */
+ public AlignedText normalizeAligned(CharSequence text) {
+ Objects.requireNonNull(text, "text");
Review Comment:
Please IllegalArgumentException instead of requireNonNull's
NullPointerException.
Add proper doc for IllegalArgumentException to Javadoc here.
##########
opennlp-api/src/main/java/opennlp/tools/util/normalizer/Alignment.java:
##########
@@ -0,0 +1,293 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package opennlp.tools.util.normalizer;
+
+import java.util.Arrays;
+
+import opennlp.tools.util.Span;
+
+/**
+ * A bidirectional alignment between an original text and a normalized form of
it.
+ *
+ * <p>Normalization edits text in ways that move character offsets: a run of
whitespace collapses to
+ * one space, a supplementary dash folds to a single ASCII hyphen, a case fold
can grow text
+ * (German {@code eszett} to {@code ss}), and trimming or stripping deletes
characters outright. An
+ * {@code Alignment} records those edits as a sequence of <em>equal</em> runs
(text copied through
+ * unchanged in length) and <em>replace</em> runs (a block of original
characters that produced a
+ * block of normalized characters), so any span in either form can be mapped
to the other.</p>
+ *
+ * <p>Because it represents deletions as gaps and expansions as shared blocks
(rather than storing a
+ * single original offset per normalized character, which would assume the
normalized text
+ * contiguously covers the original), mapping is done
+ * span to span ({@link #toOriginalSpan(int, int)} / {@link
#toNormalizedSpan(int, int)}) so a match
+ * that ends next to deleted text reports a tight span rather than
over-covering the deletion. Two
+ * alignments compose with {@link #andThen(Alignment)}, which is what lets a
multi-stage
+ * normalization pipeline still map a result all the way back to the
original.</p>
+ *
+ * <p>Instances are immutable and thread-safe; build one with {@link
Builder}.</p>
+ */
+public final class Alignment {
+
+ // For normalized character k, originalStart[k]/originalEnd[k] are the
half-open original range it
+ // was produced from. Characters copied unchanged map one to one; characters
from a collapse or
+ // expansion share their run's whole original range (it cannot be
subdivided); deleted original
+ // characters appear as a gap that no normalized character covers.
+ private final int[] originalStart;
+ private final int[] originalEnd;
+ private final int originalLength;
+
+ private Alignment(int[] originalStart, int[] originalEnd, int
originalLength) {
+ this.originalStart = originalStart;
+ this.originalEnd = originalEnd;
+ this.originalLength = originalLength;
+ }
+
+ /** {@return the length of the normalized text this alignment was built for}
*/
+ public int normalizedLength() {
+ return originalStart.length;
+ }
+
+ /** {@return the length of the original text this alignment was built for} */
+ public int originalLength() {
+ return originalLength;
+ }
+
+ /**
+ * Maps a half-open span of the normalized text to the tightest half-open
span of the original
+ * text that produced it.
+ *
+ * @param normalizedStart The inclusive start offset, in {@code [0,
normalizedLength()]}.
+ * @param normalizedEnd The exclusive end offset, in {@code
[normalizedStart, normalizedLength()]}.
+ * @return The corresponding original span.
+ * @throws IndexOutOfBoundsException Thrown if the offsets are out of range
or inverted.
+ */
+ public Span toOriginalSpan(int normalizedStart, int normalizedEnd) {
+ checkRange(normalizedStart, normalizedEnd, normalizedLength());
+ if (normalizedStart == normalizedEnd) {
+ final int at = normalizedStart < normalizedLength()
+ ? originalStart[normalizedStart] : originalLength;
+ return new Span(at, at);
+ }
+ return new Span(originalStart[normalizedStart], originalEnd[normalizedEnd
- 1]);
+ }
+
+ /**
+ * Maps a half-open span of the original text to the half-open span of the
normalized text that
+ * covers it. Original characters that were deleted map to an empty span at
the point where they
+ * were removed.
+ *
+ * @param originalStartOffset The inclusive start offset, in {@code [0,
originalLength()]}.
+ * @param originalEndOffset The exclusive end offset, in {@code
[originalStartOffset, originalLength()]}.
+ * @return The corresponding normalized span.
+ * @throws IndexOutOfBoundsException Thrown if the offsets are out of range
or inverted.
+ */
+ public Span toNormalizedSpan(int originalStartOffset, int originalEndOffset)
{
+ checkRange(originalStartOffset, originalEndOffset, originalLength);
+ final int start = firstIndexEndingAfter(originalStartOffset);
+ final int end = firstIndexStartingAtOrAfter(originalEndOffset);
+ return new Span(start, Math.max(start, end));
+ }
+
+ /**
+ * Maps a normalized offset to the original offset where its character
begins (start semantics).
+ * Prefer {@link #toOriginalSpan(int, int)} for mapping a match, since a
single offset cannot
+ * distinguish the start and end of a span across a deletion.
+ *
+ * @param normalizedOffset An offset in {@code [0, normalizedLength()]}.
+ * @return The corresponding original offset.
+ * @throws IndexOutOfBoundsException Thrown if {@code normalizedOffset} is
out of range.
+ */
+ public int toOriginalOffset(int normalizedOffset) {
+ if (normalizedOffset < 0 || normalizedOffset > normalizedLength()) {
+ throw new IndexOutOfBoundsException("normalized offset " +
normalizedOffset
+ + " is outside [0, " + normalizedLength() + "]");
+ }
+ return normalizedOffset < normalizedLength() ?
originalStart[normalizedOffset] : originalLength;
+ }
+
+ /**
+ * Composes this alignment with one that further normalizes this alignment's
normalized text.
+ *
+ * <p>If this maps {@code original -> middle} and {@code next} maps {@code
middle -> final}, the
+ * result maps {@code original -> final} directly, so a span found in the
final text can be mapped
+ * straight back to the original without keeping the intermediate stages.</p>
+ *
+ * @param next The next stage, whose original side is this stage's
normalized text.
+ * @return The composed alignment.
+ * @throws IllegalArgumentException Thrown if {@code next.originalLength()}
does not equal this
+ * {@code normalizedLength()} (the stages do not line up).
+ */
+ public Alignment andThen(Alignment next) {
+ if (next.originalLength != normalizedLength()) {
+ throw new IllegalArgumentException("stages do not line up: this
normalizedLength="
+ + normalizedLength() + " but next originalLength=" +
next.originalLength);
+ }
+ final int finalLength = next.normalizedLength();
+ final int[] starts = new int[finalLength];
+ final int[] ends = new int[finalLength];
+ for (int f = 0; f < finalLength; f++) {
+ final int middleStart = next.originalStart[f];
+ final int middleEnd = next.originalEnd[f];
+ final int start = middleStart < normalizedLength() ?
originalStart[middleStart] : originalLength;
+ final int end = middleEnd > 0 ? originalEnd[middleEnd - 1] : 0;
+ starts[f] = start;
+ // Math.max keeps the original span non-inverted. When next inserted
this final character
+ // (a zero-width middle range, middleStart == middleEnd) the max
collapses it to a zero-width
+ // original span -- correct for every insertion except one landing
strictly inside an
+ // expansion this stage produced, where the characters on either side
share one atomic
+ // original block (originalEnd[middleEnd - 1] >
originalStart[middleStart]) that has no
+ // interior offset to point at. There the insertion is attributed to
that whole block, the
+ // only choice that keeps originalStart/originalEnd sorted so
toOriginalSpan/toNormalizedSpan
+ // keep their O(log n) search; forcing it to zero-width would push
originalEnd below its
+ // predecessor and corrupt the reverse mapping.
+ ends[f] = Math.max(start, end);
+ }
+ return new Alignment(starts, ends, originalLength);
+ }
+
+ // First normalized index whose original coverage ends strictly after offset
(so it covers or
+ // follows offset); normalizedLength() when offset is at or past the last
covered original char.
+ private int firstIndexEndingAfter(int offset) {
+ int low = 0;
+ int high = originalEnd.length;
+ while (low < high) {
+ final int mid = (low + high) >>> 1;
+ if (originalEnd[mid] > offset) {
+ high = mid;
+ } else {
+ low = mid + 1;
+ }
+ }
+ return low;
+ }
+
+ // First normalized index whose original coverage starts at or after offset.
+ private int firstIndexStartingAtOrAfter(int offset) {
+ int low = 0;
+ int high = originalStart.length;
+ while (low < high) {
+ final int mid = (low + high) >>> 1;
+ if (originalStart[mid] >= offset) {
+ high = mid;
+ } else {
+ low = mid + 1;
+ }
+ }
+ return low;
+ }
+
+ private static void checkRange(int start, int end, int length) {
+ if (start < 0 || end > length || start > end) {
+ throw new IndexOutOfBoundsException("span [" + start + ", " + end + ")
is outside [0, "
+ + length + "]");
+ }
+ }
+
+ /**
+ * Builds an {@link Alignment} as the normalized text is produced, by
recording each edit in order.
+ * Call {@link #equal(int)} for characters copied through unchanged and
{@link #replace(int, int)}
+ * for a block that was rewritten (including deletions and insertions), then
{@link #build(int)}.
+ */
+ public static final class Builder {
+
+ private static final int MAX_ARRAY_SIZE = Integer.MAX_VALUE - 8;
+
+ private int[] starts = new int[16];
+ private int[] ends = new int[16];
+ private int count;
+ private int originalCursor;
+
+ /**
+ * Records {@code charCount} characters copied through unchanged (a one to
one run).
+ *
+ * @param charCount The number of UTF-16 characters; must not be negative.
+ * @return This builder.
+ */
+ public Builder equal(int charCount) {
+ if (charCount < 0) {
+ throw new IllegalArgumentException("charCount must not be negative: "
+ charCount);
+ }
+ for (int i = 0; i < charCount; i++) {
+ append(originalCursor, originalCursor + 1);
+ originalCursor++;
+ }
+ return this;
+ }
+
+ /**
+ * Records a rewritten block: {@code originalCount} original characters
that produced
+ * {@code normalizedCount} normalized characters. Each produced character
is attributed to the
+ * whole original block, since a collapse or expansion cannot be
subdivided. {@code 0} for
+ * {@code normalizedCount} is a deletion; {@code 0} for {@code
originalCount} is an insertion.
+ *
+ * @param originalCount The number of original characters consumed; must
not be negative.
+ * @param normalizedCount The number of normalized characters produced;
must not be negative.
+ * @return This builder.
+ */
+ public Builder replace(int originalCount, int normalizedCount) {
+ if (originalCount < 0 || normalizedCount < 0) {
+ throw new IllegalArgumentException("counts must not be negative: " +
originalCount
Review Comment:
Please add IllegalArgumentException to Javadoc here.
##########
opennlp-api/src/main/java/opennlp/tools/util/normalizer/CharClass.java:
##########
@@ -300,6 +300,175 @@ public String removeAll(CharSequence text) {
return out.toString();
}
+ /**
+ * Like {@link #normalize(CharSequence)} but also produces the {@link
Alignment} back to the
+ * original text.
+ *
+ * @param text The text to normalize.
+ * @return The normalized text and its alignment.
+ */
+ public AlignedText normalizeAligned(CharSequence text) {
+ Objects.requireNonNull(text, "text");
+ final StringBuilder out = new StringBuilder(text.length());
+ final Alignment.Builder alignment = new Alignment.Builder();
+ final int length = text.length();
+ int i = 0;
+ while (i < length) {
+ final int codePoint = Character.codePointAt(text, i);
+ final int charCount = Character.charCount(codePoint);
+ if (members.contains(codePoint)) {
+ out.appendCodePoint(replacement);
+ alignment.replace(charCount, Character.charCount(replacement));
+ } else {
+ out.appendCodePoint(codePoint);
+ alignment.equal(charCount);
+ }
+ i += charCount;
+ }
+ return new AlignedText(text, out.toString(), alignment.build(length));
+ }
+
+ /**
+ * Like {@link #collapse(CharSequence)} but also produces the {@link
Alignment} back to the
+ * original text. Each collapsed run maps to the run's whole original extent.
+ *
+ * @param text The text to collapse.
+ * @return The collapsed text and its alignment.
+ */
+ public AlignedText collapseAligned(CharSequence text) {
+ Objects.requireNonNull(text, "text");
+ final StringBuilder out = new StringBuilder(text.length());
+ final Alignment.Builder alignment = new Alignment.Builder();
+ final int length = text.length();
+ int i = 0;
+ while (i < length) {
+ final int codePoint = Character.codePointAt(text, i);
+ if (members.contains(codePoint)) {
+ final int runEnd = skipRun(text, i);
+ out.appendCodePoint(replacement);
+ alignment.replace(runEnd - i, Character.charCount(replacement));
+ i = runEnd;
+ } else {
+ final int charCount = Character.charCount(codePoint);
+ out.appendCodePoint(codePoint);
+ alignment.equal(charCount);
+ i += charCount;
+ }
+ }
+ return new AlignedText(text, out.toString(), alignment.build(length));
+ }
+
+ /**
+ * Like {@link #collapsePreserving(CharSequence, CodePointSet, int)} but
also produces the
+ * {@link Alignment} back to the original text.
+ *
+ * @param text The text to collapse.
+ * @param keep The member code points whose presence in a run preserves
structure.
+ * @param keepReplacement The replacement emitted for a run that contains a
{@code keep} member.
+ * @return The collapsed text and its alignment.
+ * @throws IllegalArgumentException Thrown if {@code keepReplacement} is not
a valid code point.
+ */
+ public AlignedText collapsePreservingAligned(CharSequence text, CodePointSet
keep,
+ int keepReplacement) {
+ Objects.requireNonNull(text, "text");
+ Objects.requireNonNull(keep, "keep");
+ requireValidCodePoint(keepReplacement);
+ final StringBuilder out = new StringBuilder(text.length());
+ final Alignment.Builder alignment = new Alignment.Builder();
+ final int length = text.length();
+ int i = 0;
+ while (i < length) {
+ final int codePoint = Character.codePointAt(text, i);
+ if (members.contains(codePoint)) {
+ boolean preserve = keep.contains(codePoint);
+ int j = i + Character.charCount(codePoint);
+ while (j < length) {
+ final int next = Character.codePointAt(text, j);
+ if (!members.contains(next)) {
+ break;
+ }
+ preserve |= keep.contains(next);
+ j += Character.charCount(next);
+ }
+ final int emitted = preserve ? keepReplacement : replacement;
+ out.appendCodePoint(emitted);
+ alignment.replace(j - i, Character.charCount(emitted));
+ i = j;
+ } else {
+ final int charCount = Character.charCount(codePoint);
+ out.appendCodePoint(codePoint);
+ alignment.equal(charCount);
+ i += charCount;
+ }
+ }
+ return new AlignedText(text, out.toString(), alignment.build(length));
+ }
+
+ /**
+ * Like {@link #trim(CharSequence)} but also produces the {@link Alignment}
back to the original
+ * text. The trimmed leading and trailing members appear as deletions, so a
span never reports
+ * through them.
+ *
+ * @param text The text to trim.
+ * @return The trimmed text and its alignment.
+ */
+ public AlignedText trimAligned(CharSequence text) {
+ Objects.requireNonNull(text, "text");
Review Comment:
Please IllegalArgumentException instead of requireNonNull's
NullPointerException.
Adjust doc for IllegalArgumentException to Javadoc here.
##########
opennlp-api/src/main/java/opennlp/tools/util/normalizer/CharClass.java:
##########
@@ -329,6 +498,37 @@ public static String substitute(CharSequence text,
IntFunction<String> substitut
return out.toString();
}
+ /**
+ * Like {@link #substitute(CharSequence, IntFunction)} but also produces the
{@link Alignment} back
+ * to the original text. Each replaced code point maps to its replacement
string as one block.
+ *
+ * @param text The text to transform.
+ * @param substitution The replacement for a code point, or {@code null} to
copy it through.
+ * @return The transformed text and its alignment.
+ */
+ public static AlignedText substituteAligned(CharSequence text,
IntFunction<String> substitution) {
+ Objects.requireNonNull(text, "text");
Review Comment:
Please IllegalArgumentException instead of requireNonNull's
NullPointerException.
Same for second parameter substitution.
Add doc for IllegalArgumentException to Javadoc here. It should be generic
for any parameter violation.
##########
opennlp-api/src/main/java/opennlp/tools/util/normalizer/CharClass.java:
##########
@@ -300,6 +300,175 @@ public String removeAll(CharSequence text) {
return out.toString();
}
+ /**
+ * Like {@link #normalize(CharSequence)} but also produces the {@link
Alignment} back to the
+ * original text.
+ *
+ * @param text The text to normalize.
+ * @return The normalized text and its alignment.
+ */
+ public AlignedText normalizeAligned(CharSequence text) {
+ Objects.requireNonNull(text, "text");
+ final StringBuilder out = new StringBuilder(text.length());
+ final Alignment.Builder alignment = new Alignment.Builder();
+ final int length = text.length();
+ int i = 0;
+ while (i < length) {
+ final int codePoint = Character.codePointAt(text, i);
+ final int charCount = Character.charCount(codePoint);
+ if (members.contains(codePoint)) {
+ out.appendCodePoint(replacement);
+ alignment.replace(charCount, Character.charCount(replacement));
+ } else {
+ out.appendCodePoint(codePoint);
+ alignment.equal(charCount);
+ }
+ i += charCount;
+ }
+ return new AlignedText(text, out.toString(), alignment.build(length));
+ }
+
+ /**
+ * Like {@link #collapse(CharSequence)} but also produces the {@link
Alignment} back to the
+ * original text. Each collapsed run maps to the run's whole original extent.
+ *
+ * @param text The text to collapse.
+ * @return The collapsed text and its alignment.
+ */
+ public AlignedText collapseAligned(CharSequence text) {
+ Objects.requireNonNull(text, "text");
+ final StringBuilder out = new StringBuilder(text.length());
+ final Alignment.Builder alignment = new Alignment.Builder();
+ final int length = text.length();
+ int i = 0;
+ while (i < length) {
+ final int codePoint = Character.codePointAt(text, i);
+ if (members.contains(codePoint)) {
+ final int runEnd = skipRun(text, i);
+ out.appendCodePoint(replacement);
+ alignment.replace(runEnd - i, Character.charCount(replacement));
+ i = runEnd;
+ } else {
+ final int charCount = Character.charCount(codePoint);
+ out.appendCodePoint(codePoint);
+ alignment.equal(charCount);
+ i += charCount;
+ }
+ }
+ return new AlignedText(text, out.toString(), alignment.build(length));
+ }
+
+ /**
+ * Like {@link #collapsePreserving(CharSequence, CodePointSet, int)} but
also produces the
+ * {@link Alignment} back to the original text.
+ *
+ * @param text The text to collapse.
+ * @param keep The member code points whose presence in a run preserves
structure.
+ * @param keepReplacement The replacement emitted for a run that contains a
{@code keep} member.
+ * @return The collapsed text and its alignment.
+ * @throws IllegalArgumentException Thrown if {@code keepReplacement} is not
a valid code point.
+ */
+ public AlignedText collapsePreservingAligned(CharSequence text, CodePointSet
keep,
+ int keepReplacement) {
+ Objects.requireNonNull(text, "text");
+ Objects.requireNonNull(keep, "keep");
+ requireValidCodePoint(keepReplacement);
+ final StringBuilder out = new StringBuilder(text.length());
+ final Alignment.Builder alignment = new Alignment.Builder();
+ final int length = text.length();
+ int i = 0;
+ while (i < length) {
+ final int codePoint = Character.codePointAt(text, i);
+ if (members.contains(codePoint)) {
+ boolean preserve = keep.contains(codePoint);
+ int j = i + Character.charCount(codePoint);
+ while (j < length) {
+ final int next = Character.codePointAt(text, j);
+ if (!members.contains(next)) {
+ break;
+ }
+ preserve |= keep.contains(next);
+ j += Character.charCount(next);
+ }
+ final int emitted = preserve ? keepReplacement : replacement;
+ out.appendCodePoint(emitted);
+ alignment.replace(j - i, Character.charCount(emitted));
+ i = j;
+ } else {
+ final int charCount = Character.charCount(codePoint);
+ out.appendCodePoint(codePoint);
+ alignment.equal(charCount);
+ i += charCount;
+ }
+ }
+ return new AlignedText(text, out.toString(), alignment.build(length));
+ }
+
+ /**
+ * Like {@link #trim(CharSequence)} but also produces the {@link Alignment}
back to the original
+ * text. The trimmed leading and trailing members appear as deletions, so a
span never reports
+ * through them.
+ *
+ * @param text The text to trim.
+ * @return The trimmed text and its alignment.
+ */
+ public AlignedText trimAligned(CharSequence text) {
+ Objects.requireNonNull(text, "text");
+ final int length = text.length();
+ int start = 0;
+ while (start < length) {
+ final int codePoint = Character.codePointAt(text, start);
+ if (!members.contains(codePoint)) {
+ break;
+ }
+ start += Character.charCount(codePoint);
+ }
+ int end = length;
+ while (end > start) {
+ final int codePoint = Character.codePointBefore(text, end);
+ if (!members.contains(codePoint)) {
+ break;
+ }
+ end -= Character.charCount(codePoint);
+ }
+ final Alignment.Builder alignment = new Alignment.Builder();
+ if (start > 0) {
+ alignment.replace(start, 0);
+ }
+ alignment.equal(end - start);
+ if (end < length) {
+ alignment.replace(length - end, 0);
+ }
+ return new AlignedText(text, text.subSequence(start, end).toString(),
alignment.build(length));
+ }
+
+ /**
+ * Like {@link #removeAll(CharSequence)} but also produces the {@link
Alignment} back to the
+ * original text. Every removed member appears as a deletion, so a span
never reports through one.
+ *
+ * @param text The text to filter.
+ * @return The filtered text and its alignment.
+ */
+ public AlignedText removeAllAligned(CharSequence text) {
+ Objects.requireNonNull(text, "text");
Review Comment:
Please IllegalArgumentException instead of requireNonNull's
NullPointerException.
Adjust doc for IllegalArgumentException to Javadoc here.
##########
opennlp-api/src/main/java/opennlp/tools/util/normalizer/CharClass.java:
##########
@@ -300,6 +300,175 @@ public String removeAll(CharSequence text) {
return out.toString();
}
+ /**
+ * Like {@link #normalize(CharSequence)} but also produces the {@link
Alignment} back to the
+ * original text.
+ *
+ * @param text The text to normalize.
+ * @return The normalized text and its alignment.
+ */
+ public AlignedText normalizeAligned(CharSequence text) {
+ Objects.requireNonNull(text, "text");
+ final StringBuilder out = new StringBuilder(text.length());
+ final Alignment.Builder alignment = new Alignment.Builder();
+ final int length = text.length();
+ int i = 0;
+ while (i < length) {
+ final int codePoint = Character.codePointAt(text, i);
+ final int charCount = Character.charCount(codePoint);
+ if (members.contains(codePoint)) {
+ out.appendCodePoint(replacement);
+ alignment.replace(charCount, Character.charCount(replacement));
+ } else {
+ out.appendCodePoint(codePoint);
+ alignment.equal(charCount);
+ }
+ i += charCount;
+ }
+ return new AlignedText(text, out.toString(), alignment.build(length));
+ }
+
+ /**
+ * Like {@link #collapse(CharSequence)} but also produces the {@link
Alignment} back to the
+ * original text. Each collapsed run maps to the run's whole original extent.
+ *
+ * @param text The text to collapse.
+ * @return The collapsed text and its alignment.
+ */
+ public AlignedText collapseAligned(CharSequence text) {
+ Objects.requireNonNull(text, "text");
+ final StringBuilder out = new StringBuilder(text.length());
+ final Alignment.Builder alignment = new Alignment.Builder();
+ final int length = text.length();
+ int i = 0;
+ while (i < length) {
+ final int codePoint = Character.codePointAt(text, i);
+ if (members.contains(codePoint)) {
+ final int runEnd = skipRun(text, i);
+ out.appendCodePoint(replacement);
+ alignment.replace(runEnd - i, Character.charCount(replacement));
+ i = runEnd;
+ } else {
+ final int charCount = Character.charCount(codePoint);
+ out.appendCodePoint(codePoint);
+ alignment.equal(charCount);
+ i += charCount;
+ }
+ }
+ return new AlignedText(text, out.toString(), alignment.build(length));
+ }
+
+ /**
+ * Like {@link #collapsePreserving(CharSequence, CodePointSet, int)} but
also produces the
+ * {@link Alignment} back to the original text.
+ *
+ * @param text The text to collapse.
+ * @param keep The member code points whose presence in a run preserves
structure.
+ * @param keepReplacement The replacement emitted for a run that contains a
{@code keep} member.
+ * @return The collapsed text and its alignment.
+ * @throws IllegalArgumentException Thrown if {@code keepReplacement} is not
a valid code point.
+ */
+ public AlignedText collapsePreservingAligned(CharSequence text, CodePointSet
keep,
+ int keepReplacement) {
+ Objects.requireNonNull(text, "text");
Review Comment:
Please IllegalArgumentException instead of requireNonNull's
NullPointerException.
Same for second parameter `keep`.
Adjust doc for IllegalArgumentException to Javadoc here, as it should be
generic for any parameter violation.
##########
opennlp-core/opennlp-runtime/src/test/java/opennlp/tools/util/normalizer/AlignedNormalizerPipelineTest.java:
##########
@@ -0,0 +1,342 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package opennlp.tools.util.normalizer;
+
+import org.junit.jupiter.api.Test;
+
+import opennlp.tools.util.Span;
+
+import static org.junit.jupiter.api.Assertions.assertEquals;
+import static org.junit.jupiter.api.Assertions.assertFalse;
+import static org.junit.jupiter.api.Assertions.assertSame;
+import static org.junit.jupiter.api.Assertions.assertThrows;
+import static org.junit.jupiter.api.Assertions.assertTrue;
+
+/**
+ * Exercises {@link OffsetAwareNormalizer} and {@code
TextNormalizer.Builder.buildAligned()}: the
+ * cursor-based rungs report alignments, an aligned pipeline composes them with
+ * {@link Alignment#andThen(Alignment)} so a span found in the fully
normalized text maps back to the
+ * original input, and a non-alignable rung is rejected loudly.
+ */
+public class AlignedNormalizerPipelineTest {
Review Comment:
Please also add new test class which tests the
`LineBreakPreservingWhitespaceCharSequenceNormalizer` class. This is currently
untested in this PR!
Be hard and always provide tests for new classes.... !
##########
opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/util/normalizer/DashCharSequenceNormalizer.java:
##########
@@ -25,7 +25,7 @@
* regardless of which dash the source used. The mathematical minus signs are
left untouched by
* default, and {@code U+00AD} SOFT HYPHEN (a format character) is not treated
as a dash.</p>
*/
-public class DashCharSequenceNormalizer implements CharSequenceNormalizer {
+public class DashCharSequenceNormalizer implements OffsetAwareNormalizer {
private static final long serialVersionUID = 6620885194730155303L;
Review Comment:
The `serialVersionUID` can't be correct as the class now implements a
different interface (OffsetAwareNormalizer). Hence, it should be regenerated.
Note: All other, similar classes which now implement `OffsetAwareNormalizer`
need this recomputation of the UID value. I won't leave a comment for every
single occurrence. Please adjust accordingly on your own.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]