Re: [PR] OPENNLP-1850: Document Unicode normalization and the UAX #29 tokenizer (4/4) (opennlp)

via GitHub Tue, 23 Jun 2026 03:55:13 -0700


Copilot commented on code in PR #1106:
URL: https://github.com/apache/opennlp/pull/1106#discussion_r3459117254



##########
opennlp-docs/src/docbkx/tokenizer.xml:
##########
@@ -443,4 +452,86 @@ DetokenizationDictionary dict = new 
DetokenizationDictionary(tokens, operations)
                        </para>
                </section>
        </section>
+
+       <section xml:id="tools.tokenizer.uax29">
+               <title>Unicode Word Segmentation (UAX #29)</title>
+               <para>
+                       The package <code>opennlp.tools.tokenize.uax29</code> 
provides a tokenizer that follows the
+                       Unicode Text Segmentation algorithm
+                       (<link 
xlink:href="https://www.unicode.org/reports/tr29/";>UAX #29</link>), word 
boundary
+                       rules WB1 through WB999. It is rule based and needs no 
trained model, it works directly over
+                       a <code>CharSequence</code>, and it reports character 
offsets so the original text is

Review Comment:
   Hyphenate 'rule based' to 'rule-based' (and consider splitting the 
comma-splice into two sentences for readability).



##########
opennlp-docs/src/docbkx/namefinder.xml:
##########
@@ -155,13 +155,58 @@ Span[] nameSpans = nameFinder.find(sentence);]]>
                                        <programlisting language="java">
 <![CDATA[File model = new File("/path/to/model.onnx");
 File vocab = new File("/path/to/vocab.txt");
-Map<Integer, String> categories = new HashMap<>();
-String[] tokens = new String[]{"George", "Washington", "was", "president", 
"of", "the", "United", "States", "."};
-NameFinderDL nameFinderDL = new NameFinderDL(model, vocab, false, 
getIds2Labels());
-Span[] spans = nameFinderDL.find(tokens);]]>
+// Maps the model's output indices to its BIO labels, e.g. "O", "B-PER", 
"I-PER".
+Map<Integer, String> ids2Labels = new HashMap<>();
+SentenceDetector sentenceDetector =
+    new SentenceDetectorME(new SentenceModel(new 
File("/path/to/en-sent.bin")));
+String[] tokens = {"George", "Washington", "was", "president", "of", "the", 
"United", "States", "."};
+NameFinderDL nameFinderDL = new NameFinderDL(model, vocab, ids2Labels, 
sentenceDetector);
+// findInOriginal returns spans in the original input's coordinates.
+Span[] spans = nameFinderDL.findInOriginal(tokens);]]>

Review Comment:
   The example states `ids2Labels` maps model output indices to BIO labels, but 
then leaves the map empty. If the runtime expects actual labels, this snippet 
can fail or produce unusable output. Update the example to show how 
`ids2Labels` is populated (e.g., reading from a labels file, or calling a 
helper that returns a filled map), or explicitly mark it as a placeholder and 
show at least one concrete mapping entry.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] OPENNLP-1850: Document Unicode normalization and the UAX #29 tokenizer (4/4) (opennlp)

Reply via email to