Re: [PR] OPENNLP-1850: Document Unicode normalization and the UAX #29 tokenizer (4/4) (opennlp)

via GitHub Thu, 25 Jun 2026 02:20:00 -0700


krickert commented on code in PR #1106:
URL: https://github.com/apache/opennlp/pull/1106#discussion_r3473240544



##########
opennlp-docs/src/docbkx/namefinder.xml:
##########
@@ -155,13 +155,58 @@ Span[] nameSpans = nameFinder.find(sentence);]]>
                                        <programlisting language="java">
 <![CDATA[File model = new File("/path/to/model.onnx");
 File vocab = new File("/path/to/vocab.txt");
-Map<Integer, String> categories = new HashMap<>();
-String[] tokens = new String[]{"George", "Washington", "was", "president", 
"of", "the", "United", "States", "."};
-NameFinderDL nameFinderDL = new NameFinderDL(model, vocab, false, 
getIds2Labels());
-Span[] spans = nameFinderDL.find(tokens);]]>
+// Maps the model's output indices to its BIO labels, e.g. "O", "B-PER", 
"I-PER".
+Map<Integer, String> ids2Labels = new HashMap<>();
+SentenceDetector sentenceDetector =
+    new SentenceDetectorME(new SentenceModel(new 
File("/path/to/en-sent.bin")));
+String[] tokens = {"George", "Washington", "was", "president", "of", "the", 
"United", "States", "."};
+NameFinderDL nameFinderDL = new NameFinderDL(model, vocab, ids2Labels, 
sentenceDetector);
+// findInOriginal returns spans in the original input's coordinates.
+Span[] spans = nameFinderDL.findInOriginal(tokens);]]>

Review Comment:
   Done — the example now populates `ids2Labels` with a concrete, exhaustive 
BIO map instead of an empty one.



##########
opennlp-docs/src/docbkx/tokenizer.xml:
##########
@@ -443,4 +452,86 @@ DetokenizationDictionary dict = new 
DetokenizationDictionary(tokens, operations)
                        </para>
                </section>
        </section>
+
+       <section xml:id="tools.tokenizer.uax29">
+               <title>Unicode Word Segmentation (UAX #29)</title>
+               <para>
+                       The package <code>opennlp.tools.tokenize.uax29</code> 
provides a tokenizer that follows the
+                       Unicode Text Segmentation algorithm
+                       (<link 
xlink:href="https://www.unicode.org/reports/tr29/";>UAX #29</link>), word 
boundary
+                       rules WB1 through WB999. It is rule based and needs no 
trained model, it works directly over
+                       a <code>CharSequence</code>, and it reports character 
offsets so the original text is

Review Comment:
   Done — hyphenated to 'rule-based' and split the comma-splice.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] OPENNLP-1850: Document Unicode normalization and the UAX #29 tokenizer (4/4) (opennlp)

Reply via email to