Re: [PR] OPENNLP-1850: Document Unicode normalization and the UAX #29 tokenizer (4/4) (opennlp)

via GitHub Sun, 21 Jun 2026 20:03:21 -0700


Copilot commented on code in PR #1106:
URL: https://github.com/apache/opennlp/pull/1106#discussion_r3449575431



##########
opennlp-docs/src/docbkx/normalizer.xml:
##########
@@ -0,0 +1,586 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V5.0//EN"
+"https://cdn.docbook.org/schema/5.0/dtd/docbook.dtd";[
+]>
+<!-- Licensed to the Apache Software Foundation (ASF) under one or more 
contributor
+       license agreements. See the NOTICE file distributed with this work for 
additional
+       information regarding copyright ownership. The ASF licenses this file to
+       you under the Apache License, Version 2.0 (the "License"); you may not 
use
+       this file except in compliance with the License. You may obtain a copy 
of
+       the License at http://www.apache.org/licenses/LICENSE-2.0 Unless 
required
+       by applicable law or agreed to in writing, software distributed under 
the
+       License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR 
CONDITIONS
+       OF ANY KIND, either express or implied. See the License for the specific
+       language governing permissions and limitations under the License. -->
+
+<chapter xml:id="tools.normalizer">
+
+       <title>Text Normalization</title>
+
+       <section xml:id="tools.normalizer.introduction">
+               <title>Introduction</title>
+               <para>
+                       The package <code>opennlp.tools.util.normalizer</code> 
provides Unicode-aware text
+                       normalization for matching, search, and tokenization 
preprocessing. It cleans up the
+                       kinds of inconsistency that real text carries when it 
is copied from the web, PDFs,
+                       office documents, or multilingual sources: spacing that 
is not an ordinary space, the
+                       many dash and quotation variants, decomposed versus 
precomposed accents, non-ASCII
+                       digits, and invisible control characters.
+               </para>
+               <para>
+                       The implementation follows three principles:
+               </para>
+               <itemizedlist>
+                       <listitem>
+                               <para>
+                                       <emphasis 
role="bold">Standards-sourced.</emphasis> Membership sets come from the
+                                       Unicode Character Database (for example 
the <code>White_Space</code> and
+                                       <code>Dash</code> properties), not from 
the JVM's locale-dependent or quirky
+                                       character predicates. The library never 
relies on
+                                       <code>Character.isWhitespace</code>, 
which disagrees with the Unicode standard.

Review Comment:
   The phrasing "The library never relies on Character.isWhitespace" is 
ambiguous and can be read as applying to OpenNLP as a whole (which does use 
Character.isWhitespace elsewhere, e.g. StringUtil). Consider tightening this to 
explicitly refer to the normalization package/engine to avoid a misleading 
claim.



##########
opennlp-docs/src/docbkx/normalizer.xml:
##########
@@ -0,0 +1,586 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V5.0//EN"
+"https://cdn.docbook.org/schema/5.0/dtd/docbook.dtd";[
+]>
+<!-- Licensed to the Apache Software Foundation (ASF) under one or more 
contributor
+       license agreements. See the NOTICE file distributed with this work for 
additional
+       information regarding copyright ownership. The ASF licenses this file to
+       you under the Apache License, Version 2.0 (the "License"); you may not 
use
+       this file except in compliance with the License. You may obtain a copy 
of
+       the License at http://www.apache.org/licenses/LICENSE-2.0 Unless 
required
+       by applicable law or agreed to in writing, software distributed under 
the
+       License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR 
CONDITIONS
+       OF ANY KIND, either express or implied. See the License for the specific
+       language governing permissions and limitations under the License. -->
+
+<chapter xml:id="tools.normalizer">
+
+       <title>Text Normalization</title>
+
+       <section xml:id="tools.normalizer.introduction">
+               <title>Introduction</title>
+               <para>
+                       The package <code>opennlp.tools.util.normalizer</code> 
provides Unicode-aware text
+                       normalization for matching, search, and tokenization 
preprocessing. It cleans up the
+                       kinds of inconsistency that real text carries when it 
is copied from the web, PDFs,
+                       office documents, or multilingual sources: spacing that 
is not an ordinary space, the
+                       many dash and quotation variants, decomposed versus 
precomposed accents, non-ASCII
+                       digits, and invisible control characters.
+               </para>
+               <para>
+                       The implementation follows three principles:
+               </para>
+               <itemizedlist>
+                       <listitem>
+                               <para>
+                                       <emphasis 
role="bold">Standards-sourced.</emphasis> Membership sets come from the
+                                       Unicode Character Database (for example 
the <code>White_Space</code> and
+                                       <code>Dash</code> properties), not from 
the JVM's locale-dependent or quirky
+                                       character predicates. The library never 
relies on
+                                       <code>Character.isWhitespace</code>, 
which disagrees with the Unicode standard.
+                               </para>
+                       </listitem>
+                       <listitem>
+                               <para>
+                                       <emphasis role="bold">Cursor-based, no 
regular expressions.</emphasis> Every
+                                       operation is a single forward pass over 
the input that tests membership in O(1)
+                                       and advances by code point. This avoids 
the allocation and the catastrophic
+                                       backtracking (ReDoS) risk of regular 
expressions, and it correctly recognizes
+                                       Unicode characters that Java's 
<code>\s</code> does not.
+                               </para>
+                       </listitem>
+                       <listitem>
+                               <para>
+                                       <emphasis 
role="bold">Offset-preserving.</emphasis> The original text is always
+                                       the source of truth. Normalization 
produces a derived form for matching while the
+                                       original character offsets are kept, so 
a search hit can be reported and
+                                       highlighted against the source even 
when the normalized form has a different
+                                       length.
+                               </para>
+                       </listitem>
+               </itemizedlist>
+               <para>
+                       Two engines underpin everything: the 
<code>CharSequenceNormalizer</code> family offers
+                       ready-made, composable normalizers, and the 
<code>CharClass</code> engine is the low-level,
+                       configurable building block they are made of. Built on 
these are three higher-level
+                       features documented below: a layered term model that 
projects a token through a
+                       configurable stack of transforms while keeping every 
intermediate form (see
+                       <xref linkend="tools.normalizer.term"/>), per-language 
profiles that select the transforms
+                       appropriate to a language (see <xref 
linkend="tools.normalizer.language"/>), and confusable
+                       folding that reduces lookalike characters for matching 
(see
+                       <xref linkend="tools.normalizer.confusables"/>).
+               </para>
+       </section>
+
+       <section xml:id="tools.normalizer.normalizers">
+               <title>The normalizer family</title>
+               <para>
+                       Each normalizer implements the existing
+                       
<code>opennlp.tools.util.normalizer.CharSequenceNormalizer</code> interface
+                       (<code>CharSequence normalize(CharSequence)</code>) and 
is a shared, stateless singleton
+                       obtained through <code>getInstance()</code>. They can 
therefore be combined with the
+                       existing <code>AggregateCharSequenceNormalizer</code>, 
or with the
+                       <code>TextNormalizer</code> builder described below.

Review Comment:
   This section states that each normalizer is a singleton obtained via 
getInstance(), but several normalizers are configurable (e.g., 
CaseFoldCharSequenceNormalizer(Locale), AccentFoldCharSequenceNormalizer(Set, 
boolean)). To keep the docs accurate, describe getInstance() as providing the 
default singleton configuration, with optional constructors/overloads for 
custom behavior.



##########
opennlp-docs/src/docbkx/tokenizer.xml:
##########
@@ -443,4 +452,84 @@ DetokenizationDictionary dict = new 
DetokenizationDictionary(tokens, operations)
                        </para>
                </section>
        </section>
+
+       <section xml:id="tools.tokenizer.uax29">
+               <title>Unicode Word Segmentation (UAX #29)</title>
+               <para>
+                       The package <code>opennlp.tools.tokenize.uax29</code> 
provides a tokenizer that follows the
+                       Unicode Text Segmentation algorithm
+                       (<link 
xlink:href="https://www.unicode.org/reports/tr29/";>UAX #29</link>), word 
boundary
+                       rules WB1 through WB999. It is rule based and needs no 
trained model, it works directly over
+                       a <code>CharSequence</code>, and it reports character 
offsets so the original text is
+                       preserved for downstream processing such as the 
normalization described in
+                       <xref linkend="tools.normalizer"/>. The boundary data 
comes from the bundled Unicode
+                       Character Database (currently Unicode 17.0) and the 
implementation passes the official
+                       <code>WordBreakTest</code> conformance suite for that 
release.

Review Comment:
   The codebase (and Unicode) refer to the conformance file as 
"WordBreakTest.txt". Using the full file name here makes the reference 
unambiguous and matches the bundled test resource name.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] OPENNLP-1850: Document Unicode normalization and the UAX #29 tokenizer (4/4) (opennlp)

Reply via email to