krickert commented on code in PR #1106:
URL: https://github.com/apache/opennlp/pull/1106#discussion_r3473240544
##########
opennlp-docs/src/docbkx/namefinder.xml:
##########
@@ -155,13 +155,58 @@ Span[] nameSpans = nameFinder.find(sentence);]]>
<programlisting language="java">
<![CDATA[File model = new File("/path/to/model.onnx");
File vocab = new File("/path/to/vocab.txt");
-Map<Integer, String> categories = new HashMap<>();
-String[] tokens = new String[]{"George", "Washington", "was", "president",
"of", "the", "United", "States", "."};
-NameFinderDL nameFinderDL = new NameFinderDL(model, vocab, false,
getIds2Labels());
-Span[] spans = nameFinderDL.find(tokens);]]>
+// Maps the model's output indices to its BIO labels, e.g. "O", "B-PER",
"I-PER".
+Map<Integer, String> ids2Labels = new HashMap<>();
+SentenceDetector sentenceDetector =
+ new SentenceDetectorME(new SentenceModel(new
File("/path/to/en-sent.bin")));
+String[] tokens = {"George", "Washington", "was", "president", "of", "the",
"United", "States", "."};
+NameFinderDL nameFinderDL = new NameFinderDL(model, vocab, ids2Labels,
sentenceDetector);
+// findInOriginal returns spans in the original input's coordinates.
+Span[] spans = nameFinderDL.findInOriginal(tokens);]]>
Review Comment:
Done — the example now populates `ids2Labels` with a concrete, exhaustive
BIO map instead of an empty one.
##########
opennlp-docs/src/docbkx/tokenizer.xml:
##########
@@ -443,4 +452,86 @@ DetokenizationDictionary dict = new
DetokenizationDictionary(tokens, operations)
</para>
</section>
</section>
+
+ <section xml:id="tools.tokenizer.uax29">
+ <title>Unicode Word Segmentation (UAX #29)</title>
+ <para>
+ The package <code>opennlp.tools.tokenize.uax29</code>
provides a tokenizer that follows the
+ Unicode Text Segmentation algorithm
+ (<link
xlink:href="https://www.unicode.org/reports/tr29/">UAX #29</link>), word
boundary
+ rules WB1 through WB999. It is rule based and needs no
trained model, it works directly over
+ a <code>CharSequence</code>, and it reports character
offsets so the original text is
Review Comment:
Done — hyphenated to 'rule-based' and split the comma-splice.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]