Copilot commented on code in PR #1106:
URL: https://github.com/apache/opennlp/pull/1106#discussion_r3459117254
##########
opennlp-docs/src/docbkx/tokenizer.xml:
##########
@@ -443,4 +452,86 @@ DetokenizationDictionary dict = new
DetokenizationDictionary(tokens, operations)
</para>
</section>
</section>
+
+ <section xml:id="tools.tokenizer.uax29">
+ <title>Unicode Word Segmentation (UAX #29)</title>
+ <para>
+ The package <code>opennlp.tools.tokenize.uax29</code>
provides a tokenizer that follows the
+ Unicode Text Segmentation algorithm
+ (<link
xlink:href="https://www.unicode.org/reports/tr29/">UAX #29</link>), word
boundary
+ rules WB1 through WB999. It is rule based and needs no
trained model, it works directly over
+ a <code>CharSequence</code>, and it reports character
offsets so the original text is
Review Comment:
Hyphenate 'rule based' to 'rule-based' (and consider splitting the
comma-splice into two sentences for readability).
##########
opennlp-docs/src/docbkx/namefinder.xml:
##########
@@ -155,13 +155,58 @@ Span[] nameSpans = nameFinder.find(sentence);]]>
<programlisting language="java">
<![CDATA[File model = new File("/path/to/model.onnx");
File vocab = new File("/path/to/vocab.txt");
-Map<Integer, String> categories = new HashMap<>();
-String[] tokens = new String[]{"George", "Washington", "was", "president",
"of", "the", "United", "States", "."};
-NameFinderDL nameFinderDL = new NameFinderDL(model, vocab, false,
getIds2Labels());
-Span[] spans = nameFinderDL.find(tokens);]]>
+// Maps the model's output indices to its BIO labels, e.g. "O", "B-PER",
"I-PER".
+Map<Integer, String> ids2Labels = new HashMap<>();
+SentenceDetector sentenceDetector =
+ new SentenceDetectorME(new SentenceModel(new
File("/path/to/en-sent.bin")));
+String[] tokens = {"George", "Washington", "was", "president", "of", "the",
"United", "States", "."};
+NameFinderDL nameFinderDL = new NameFinderDL(model, vocab, ids2Labels,
sentenceDetector);
+// findInOriginal returns spans in the original input's coordinates.
+Span[] spans = nameFinderDL.findInOriginal(tokens);]]>
Review Comment:
The example states `ids2Labels` maps model output indices to BIO labels, but
then leaves the map empty. If the runtime expects actual labels, this snippet
can fail or produce unusable output. Update the example to show how
`ids2Labels` is populated (e.g., reading from a labels file, or calling a
helper that returns a filled map), or explicitly mark it as a placeholder and
show at least one concrete mapping entry.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]