Copilot commented on code in PR #1106: URL: https://github.com/apache/opennlp/pull/1106#discussion_r3449575431
########## opennlp-docs/src/docbkx/normalizer.xml: ########## @@ -0,0 +1,586 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V5.0//EN" +"https://cdn.docbook.org/schema/5.0/dtd/docbook.dtd"[ +]> +<!-- Licensed to the Apache Software Foundation (ASF) under one or more contributor + license agreements. See the NOTICE file distributed with this work for additional + information regarding copyright ownership. The ASF licenses this file to + you under the Apache License, Version 2.0 (the "License"); you may not use + this file except in compliance with the License. You may obtain a copy of + the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required + by applicable law or agreed to in writing, software distributed under the + License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS + OF ANY KIND, either express or implied. See the License for the specific + language governing permissions and limitations under the License. --> + +<chapter xml:id="tools.normalizer"> + + <title>Text Normalization</title> + + <section xml:id="tools.normalizer.introduction"> + <title>Introduction</title> + <para> + The package <code>opennlp.tools.util.normalizer</code> provides Unicode-aware text + normalization for matching, search, and tokenization preprocessing. It cleans up the + kinds of inconsistency that real text carries when it is copied from the web, PDFs, + office documents, or multilingual sources: spacing that is not an ordinary space, the + many dash and quotation variants, decomposed versus precomposed accents, non-ASCII + digits, and invisible control characters. + </para> + <para> + The implementation follows three principles: + </para> + <itemizedlist> + <listitem> + <para> + <emphasis role="bold">Standards-sourced.</emphasis> Membership sets come from the + Unicode Character Database (for example the <code>White_Space</code> and + <code>Dash</code> properties), not from the JVM's locale-dependent or quirky + character predicates. The library never relies on + <code>Character.isWhitespace</code>, which disagrees with the Unicode standard. Review Comment: The phrasing "The library never relies on Character.isWhitespace" is ambiguous and can be read as applying to OpenNLP as a whole (which does use Character.isWhitespace elsewhere, e.g. StringUtil). Consider tightening this to explicitly refer to the normalization package/engine to avoid a misleading claim. ########## opennlp-docs/src/docbkx/normalizer.xml: ########## @@ -0,0 +1,586 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V5.0//EN" +"https://cdn.docbook.org/schema/5.0/dtd/docbook.dtd"[ +]> +<!-- Licensed to the Apache Software Foundation (ASF) under one or more contributor + license agreements. See the NOTICE file distributed with this work for additional + information regarding copyright ownership. The ASF licenses this file to + you under the Apache License, Version 2.0 (the "License"); you may not use + this file except in compliance with the License. You may obtain a copy of + the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required + by applicable law or agreed to in writing, software distributed under the + License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS + OF ANY KIND, either express or implied. See the License for the specific + language governing permissions and limitations under the License. --> + +<chapter xml:id="tools.normalizer"> + + <title>Text Normalization</title> + + <section xml:id="tools.normalizer.introduction"> + <title>Introduction</title> + <para> + The package <code>opennlp.tools.util.normalizer</code> provides Unicode-aware text + normalization for matching, search, and tokenization preprocessing. It cleans up the + kinds of inconsistency that real text carries when it is copied from the web, PDFs, + office documents, or multilingual sources: spacing that is not an ordinary space, the + many dash and quotation variants, decomposed versus precomposed accents, non-ASCII + digits, and invisible control characters. + </para> + <para> + The implementation follows three principles: + </para> + <itemizedlist> + <listitem> + <para> + <emphasis role="bold">Standards-sourced.</emphasis> Membership sets come from the + Unicode Character Database (for example the <code>White_Space</code> and + <code>Dash</code> properties), not from the JVM's locale-dependent or quirky + character predicates. The library never relies on + <code>Character.isWhitespace</code>, which disagrees with the Unicode standard. + </para> + </listitem> + <listitem> + <para> + <emphasis role="bold">Cursor-based, no regular expressions.</emphasis> Every + operation is a single forward pass over the input that tests membership in O(1) + and advances by code point. This avoids the allocation and the catastrophic + backtracking (ReDoS) risk of regular expressions, and it correctly recognizes + Unicode characters that Java's <code>\s</code> does not. + </para> + </listitem> + <listitem> + <para> + <emphasis role="bold">Offset-preserving.</emphasis> The original text is always + the source of truth. Normalization produces a derived form for matching while the + original character offsets are kept, so a search hit can be reported and + highlighted against the source even when the normalized form has a different + length. + </para> + </listitem> + </itemizedlist> + <para> + Two engines underpin everything: the <code>CharSequenceNormalizer</code> family offers + ready-made, composable normalizers, and the <code>CharClass</code> engine is the low-level, + configurable building block they are made of. Built on these are three higher-level + features documented below: a layered term model that projects a token through a + configurable stack of transforms while keeping every intermediate form (see + <xref linkend="tools.normalizer.term"/>), per-language profiles that select the transforms + appropriate to a language (see <xref linkend="tools.normalizer.language"/>), and confusable + folding that reduces lookalike characters for matching (see + <xref linkend="tools.normalizer.confusables"/>). + </para> + </section> + + <section xml:id="tools.normalizer.normalizers"> + <title>The normalizer family</title> + <para> + Each normalizer implements the existing + <code>opennlp.tools.util.normalizer.CharSequenceNormalizer</code> interface + (<code>CharSequence normalize(CharSequence)</code>) and is a shared, stateless singleton + obtained through <code>getInstance()</code>. They can therefore be combined with the + existing <code>AggregateCharSequenceNormalizer</code>, or with the + <code>TextNormalizer</code> builder described below. Review Comment: This section states that each normalizer is a singleton obtained via getInstance(), but several normalizers are configurable (e.g., CaseFoldCharSequenceNormalizer(Locale), AccentFoldCharSequenceNormalizer(Set, boolean)). To keep the docs accurate, describe getInstance() as providing the default singleton configuration, with optional constructors/overloads for custom behavior. ########## opennlp-docs/src/docbkx/tokenizer.xml: ########## @@ -443,4 +452,84 @@ DetokenizationDictionary dict = new DetokenizationDictionary(tokens, operations) </para> </section> </section> + + <section xml:id="tools.tokenizer.uax29"> + <title>Unicode Word Segmentation (UAX #29)</title> + <para> + The package <code>opennlp.tools.tokenize.uax29</code> provides a tokenizer that follows the + Unicode Text Segmentation algorithm + (<link xlink:href="https://www.unicode.org/reports/tr29/">UAX #29</link>), word boundary + rules WB1 through WB999. It is rule based and needs no trained model, it works directly over + a <code>CharSequence</code>, and it reports character offsets so the original text is + preserved for downstream processing such as the normalization described in + <xref linkend="tools.normalizer"/>. The boundary data comes from the bundled Unicode + Character Database (currently Unicode 17.0) and the implementation passes the official + <code>WordBreakTest</code> conformance suite for that release. Review Comment: The codebase (and Unicode) refer to the conformance file as "WordBreakTest.txt". Using the full file name here makes the reference unambiguous and matches the bundled test resource name. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
