[
https://issues.apache.org/jira/browse/OPENNLP-216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17250853#comment-17250853
]
ASF GitHub Bot commented on OPENNLP-216:
----------------------------------------
kinow commented on a change in pull request #388:
URL: https://github.com/apache/opennlp/pull/388#discussion_r544862453
##########
File path: opennlp-docs/src/docbkx/tokenizer.xml
##########
@@ -396,19 +396,78 @@ test -> NO_OPERATION
<![CDATA[
He said "This is a test".]]>
</programlisting>
- TODO: Add documentation about the dictionary format and how to
use the API. Contributions are welcome.
</para>
<section id="tools.tokenizer.detokenizing.api">
<title>Detokenizing API</title>
- <para>TODO: Write documentation about the detokenizer
api. Any contributions
-are very welcome. If you want to contribute please contact us on the mailing
list
-or comment on the jira issue <ulink
url="https://issues.apache.org/jira/browse/OPENNLP-216">OPENNLP-216</ulink>.</para>
+ <para>
+ The Detokenizer can be use to detokenize the
tokens to String.
+ To instantiate the Detokenizer (a rule based
detokenizer)
+ a DetokenizationDictionary (the rule of
dictionary) must be created first.
+ The following code sample shows how a rule
dictionary can be loaded.
+ <programlisting language="java">
+ <![CDATA[
+InputStream dictIn = new FileInputStream("latin-detokenizer.xml");
+DetokenizationDictionary dict = new DetokenizationDictionary(dictIn);]]>
+ </programlisting>
+ After the rule dictionary is loadeed the
DictionaryDetokenizer can be instantiated.
+ <programlisting language="java">
+ <![CDATA[
+Detokenizer detokenizer = new DictionaryDetokenizer(dict);]]>
+ </programlisting>
+ The detokenizer offers two detokenize
methods,the first detokenize the input tokens into a String.
+ <programlisting language="java">
+ <![CDATA[
+String[] tokens = new String[]{"A", "co", "-", "worker", "helped", "."};
+String sentence = detokenizer.detokenize(tokens, null);
+Assert.assertEquals("A co-worker helped.", sentence);]]>
+ </programlisting>
+ Tokens which are connected without a space
inbetween can be spearated by a split marker.
Review comment:
s/inbetween/in-between?
##########
File path: opennlp-docs/src/docbkx/tokenizer.xml
##########
@@ -396,19 +396,78 @@ test -> NO_OPERATION
<![CDATA[
He said "This is a test".]]>
</programlisting>
- TODO: Add documentation about the dictionary format and how to
use the API. Contributions are welcome.
</para>
<section id="tools.tokenizer.detokenizing.api">
<title>Detokenizing API</title>
- <para>TODO: Write documentation about the detokenizer
api. Any contributions
-are very welcome. If you want to contribute please contact us on the mailing
list
-or comment on the jira issue <ulink
url="https://issues.apache.org/jira/browse/OPENNLP-216">OPENNLP-216</ulink>.</para>
+ <para>
+ The Detokenizer can be use to detokenize the
tokens to String.
Review comment:
s/The Detokenizer can be use/The Detokenizer can be used/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
> Add Detokenizer API section
> ---------------------------
>
> Key: OPENNLP-216
> URL: https://issues.apache.org/jira/browse/OPENNLP-216
> Project: OpenNLP
> Issue Type: Improvement
> Components: Documentation
> Reporter: Jörn Kottmann
> Priority: Major
> Labels: help-wanted
>
> The documentation is lacking a section about the detokenizer API.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)