[opennlp] branch main updated: OPENNLP-1435 Clear typos from opennlp-docs module (#480)

jzemerick Tue, 03 Jan 2023 06:50:32 -0800

This is an automated email from the ASF dual-hosted git repository.

jzemerick pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/opennlp.git



The following commit(s) were added to refs/heads/main by this push:
     new c671d607 OPENNLP-1435 Clear typos from opennlp-docs module (#480)
c671d607 is described below

commit c671d6075b531e0bdbff272187523c2c9eade75d
Author: Martin Wiesner <[email protected]>
AuthorDate: Tue Jan 3 15:50:21 2023 +0100

    OPENNLP-1435 Clear typos from opennlp-docs module (#480)
    
    - fixes typos in several dockbx files
    - switches references to http URLs, if available, to a secure form (https)
    - exchanges a non-reachable URL against a web-archived version of the 
original
---
 opennlp-docs/src/docbkx/chunker.xml          |  2 +-
 opennlp-docs/src/docbkx/cli.xml              | 20 ++++++++---------
 opennlp-docs/src/docbkx/corpora.xml          | 32 ++++++++++++++--------------
 opennlp-docs/src/docbkx/introduction.xml     |  6 +++---
 opennlp-docs/src/docbkx/langdetect.xml       | 10 ++++-----
 opennlp-docs/src/docbkx/lemmatizer.xml       |  8 +++----
 opennlp-docs/src/docbkx/machine-learning.xml |  4 ++--
 opennlp-docs/src/docbkx/morfologik-addon.xml | 12 +++++------
 opennlp-docs/src/docbkx/namefinder.xml       | 14 ++++++------
 opennlp-docs/src/docbkx/parser.xml           |  6 +++---
 opennlp-docs/src/docbkx/postagger.xml        |  8 +++----
 opennlp-docs/src/docbkx/sentdetect.xml       |  6 +++---
 opennlp-docs/src/docbkx/tokenizer.xml        | 10 ++++-----
 opennlp-docs/src/docbkx/uima-integration.xml | 28 ++++++++++++------------
 14 files changed, 83 insertions(+), 83 deletions(-)

diff --git a/opennlp-docs/src/docbkx/chunker.xml 
b/opennlp-docs/src/docbkx/chunker.xml
index 262f4734..5c65deac 100644
--- a/opennlp-docs/src/docbkx/chunker.xml
+++ b/opennlp-docs/src/docbkx/chunker.xml
@@ -150,7 +150,7 @@ Sequence topSequences[] = chunk.topKSequences(sent, pos);]]>
                </para>
                <para>
                The training data can be converted to the OpenNLP chunker 
training format,
-               which is based on <ulink 
url="http://www.cnts.ua.ac.be/conll2000/chunking";>CoNLL2000</ulink>.
+               which is based on <ulink 
url="https://www.cnts.ua.ac.be/conll2000/chunking";>CoNLL2000</ulink>.
         Other formats may also be available.
                The training data consist of three columns separated one single 
space. Each word has been put on a
                separate line and there is an empty line after each sentence. 
The first column contains
diff --git a/opennlp-docs/src/docbkx/cli.xml b/opennlp-docs/src/docbkx/cli.xml
index f809029a..adc6538d 100644
--- a/opennlp-docs/src/docbkx/cli.xml
+++ b/opennlp-docs/src/docbkx/cli.xml
@@ -32,7 +32,7 @@ under the License.
 
 <title>The Command Line Interface</title>
 
-<para>This section details the available tools and parameters of the Command 
Line Interface. For a introduction in its usage please refer to <xref 
linkend='intro.cli'/>.  </para>
+<para>This section details the available tools and parameters of the Command 
Line Interface. For an introduction in its usage please refer to <xref 
linkend='intro.cli'/>.  </para>
 
 <section id='tools.cli.doccat'>
 
@@ -92,7 +92,7 @@ Arguments description:
 <entry>sentencesDir</entry>
 <entry>sentencesDir</entry>
 <entry>No</entry>
-<entry>Dir with Leipig sentences to be used</entry>
+<entry>Dir with Leipzig sentences to be used</entry>
 </row>
 <row>
 <entry>encoding</entry>
@@ -139,7 +139,7 @@ Arguments description:
 <entry>sentencesDir</entry>
 <entry>sentencesDir</entry>
 <entry>No</entry>
-<entry>Dir with Leipig sentences to be used</entry>
+<entry>Dir with Leipzig sentences to be used</entry>
 </row>
 <row>
 <entry>encoding</entry>
@@ -197,7 +197,7 @@ Arguments description:
 <entry>sentencesDir</entry>
 <entry>sentencesDir</entry>
 <entry>No</entry>
-<entry>Dir with Leipig sentences to be used</entry>
+<entry>Dir with Leipzig sentences to be used</entry>
 </row>
 <row>
 <entry>encoding</entry>
@@ -232,7 +232,7 @@ Usage: opennlp DoccatConverter help|leipzig 
[help|options...]
 <entry>sentencesDir</entry>
 <entry>sentencesDir</entry>
 <entry>No</entry>
-<entry>Dir with Leipig sentences to be used</entry>
+<entry>Dir with Leipzig sentences to be used</entry>
 </row>
 <row>
 <entry>encoding</entry>
@@ -299,7 +299,7 @@ Arguments description:
 <entry>sentencesDir</entry>
 <entry>sentencesDir</entry>
 <entry>No</entry>
-<entry>Dir with Leipig sentences to be used</entry>
+<entry>Dir with Leipzig sentences to be used</entry>
 </row>
 <row>
 <entry>sentencesPerSample</entry>
@@ -346,7 +346,7 @@ Usage: opennlp LanguageDetectorConverter help|leipzig 
[help|options...]
 <entry>sentencesDir</entry>
 <entry>sentencesDir</entry>
 <entry>No</entry>
-<entry>Dir with Leipig sentences to be used</entry>
+<entry>Dir with Leipzig sentences to be used</entry>
 </row>
 <row>
 <entry>sentencesPerSample</entry>
@@ -410,7 +410,7 @@ Arguments description:
 <entry>sentencesDir</entry>
 <entry>sentencesDir</entry>
 <entry>No</entry>
-<entry>Dir with Leipig sentences to be used</entry>
+<entry>Dir with Leipzig sentences to be used</entry>
 </row>
 <row>
 <entry>sentencesPerSample</entry>
@@ -469,7 +469,7 @@ Arguments description:
 <entry>sentencesDir</entry>
 <entry>sentencesDir</entry>
 <entry>No</entry>
-<entry>Dir with Leipig sentences to be used</entry>
+<entry>Dir with Leipzig sentences to be used</entry>
 </row>
 <row>
 <entry>sentencesPerSample</entry>
@@ -3919,7 +3919,7 @@ Usage: opennlp ChunkerConverter help|ad [help|options...]
 Usage: opennlp Parser [-bs n -ap n -k n -tk tok_model] model < sentences 
 -bs n: Use a beam size of n.
 -ap f: Advance outcomes in with at least f% of the probability mass.
--k n: Show the top n parses.  This will also display their log-probablities.
+-k n: Show the top n parses.  This will also display their log-probabilities.
 -tk tok_model: Use the specified tokenizer model to tokenize the sentences. 
Defaults to a WhitespaceTokenizer.
 
 ]]>
diff --git a/opennlp-docs/src/docbkx/corpora.xml 
b/opennlp-docs/src/docbkx/corpora.xml
index 187c9c31..b21f61a6 100644
--- a/opennlp-docs/src/docbkx/corpora.xml
+++ b/opennlp-docs/src/docbkx/corpora.xml
@@ -144,13 +144,13 @@ F-Measure: 0.9230575441395671]]>
                <title>Getting the data</title>
                <para>The data consists of three files per language: one 
training file and two test files testa and testb.
                The first test file will be used in the development phase for 
finding good parameters for the learning system.
-               The second test file will be used for the final evaluation. 
Currently there are data files available for two languages:
+               The second test file will be used for the final evaluation. 
Currently, there are data files available for two languages:
                Spanish and Dutch.
                </para>
                <para>
                The Spanish data is a collection of news wire articles made 
available by the Spanish EFE News Agency. The articles are
-               from May 2000. The annotation was carried out by the <ulink 
url="http://www.talp.cat/";>TALP Research Center</ulink> of the Technical 
University of Catalonia (UPC)
-               and the <ulink url="http://clic.ub.edu/";>Center of Language and 
Computation (CLiC)</ulink>of the University of Barcelona (UB), and funded by 
the European Commission
+               from May 2000. The annotation was carried out by the <ulink 
url="https://www.talp.cat/";>TALP Research Center</ulink> of the Technical 
University of Catalonia (UPC)
+               and the <ulink 
url="https://web.archive.org/web/20220516042208/http://clic.ub.edu/";>Center of 
Language and Computation (CLiC)</ulink>of the University of Barcelona (UB), and 
funded by the European Commission
                through the NAMIC project (IST-1999-12392). 
                </para>
                <para>
@@ -159,12 +159,12 @@ F-Measure: 0.9230575441395671]]>
                </para>
                <para>
                You can find the Spanish files here: 
-               <ulink 
url="http://www.lsi.upc.edu/~nlp/tools/nerc/nerc.html";>http://www.lsi.upc.edu/~nlp/tools/nerc/nerc.html</ulink>
+               <ulink 
url="https://www.lsi.upc.edu/~nlp/tools/nerc/nerc.html";>https://www.lsi.upc.edu/~nlp/tools/nerc/nerc.html</ulink>
                You must download esp.train.gz, unzip it and you will see the 
file esp.train.
                </para>
                <para>
                You can find the Dutch files here: 
-               <ulink 
url="http://www.cnts.ua.ac.be/conll2002/ner.tgz";>http://www.cnts.ua.ac.be/conll2002/ner.tgz</ulink>
+               <ulink 
url="https://www.cnts.ua.ac.be/conll2002/ner.tgz";>https://www.cnts.ua.ac.be/conll2002/ner.tgz</ulink>
                You must unzip it and go to /ner/data/ned.train.gz, so you 
unzip it too, and you will see the file ned.train.
                </para>
                </section>
@@ -260,7 +260,7 @@ path: .\es_ner_person.bin]]>
                <para>
                The English data is the Reuters Corpus, which is a collection 
of news wire articles.
                The Reuters Corpus can be obtained free of charges from the 
NIST for research
-               purposes: <ulink 
url="http://trec.nist.gov/data/reuters/reuters.html";>http://trec.nist.gov/data/reuters/reuters.html</ulink>
+               purposes: <ulink 
url="https://trec.nist.gov/data/reuters/reuters.html";>https://trec.nist.gov/data/reuters/reuters.html</ulink>
                </para>
                <para>
                The German data is a collection of articles from the German 
newspaper Frankfurter
@@ -387,16 +387,16 @@ F-Measure: 0.8267557582133971]]>
        <section id="tools.corpora.arvores-deitadas">
                <title>Arvores Deitadas</title>
                <para>
-               The Portuguese corpora available at <ulink 
url="http://www.linguateca.pt";>Floresta Sintá(c)tica</ulink> project follow the 
Arvores Deitadas (AD) format. Apache OpenNLP includes tools to convert from AD 
format to native format.  
+               The Portuguese corpora available at <ulink 
url="https://www.linguateca.pt";>Floresta Sintá(c)tica</ulink> project follow 
the Arvores Deitadas (AD) format. Apache OpenNLP includes tools to convert from 
AD format to native format.
                </para>         
                <section id="tools.corpora.arvores-deitadas.getting">
                        <title>Getting the data</title>
                        <para>
-                       The Corpus can be downloaded from here: <ulink 
url="http://www.linguateca.pt/floresta/corpus.html";>http://www.linguateca.pt/floresta/corpus.html</ulink>
+                       The Corpus can be downloaded from here: <ulink 
url="https://www.linguateca.pt/floresta/corpus.html";>https://www.linguateca.pt/floresta/corpus.html</ulink>
                        </para>
                        <para>
-                       The Name Finder models were trained using the Amazonia 
corpus: <ulink 
url="http://www.linguateca.pt/floresta/ficheiros/gz/amazonia.ad.gz";>amazonia.ad</ulink>.
-                       The Chunker models were trained using the <ulink 
url="http://www.linguateca.pt/floresta/ficheiros/gz/Bosque_CF_8.0.ad.txt.gz";>Bosque_CF_8.0.ad</ulink>.
+                       The Name Finder models were trained using the Amazonia 
corpus: <ulink 
url="https://www.linguateca.pt/floresta/ficheiros/gz/amazonia.ad.gz";>amazonia.ad</ulink>.
+                       The Chunker models were trained using the <ulink 
url="https://www.linguateca.pt/floresta/ficheiros/gz/Bosque_CF_8.0.ad.txt.gz";>Bosque_CF_8.0.ad</ulink>.
                        </para>
                </section>
                
@@ -474,15 +474,15 @@ F-Measure: 0.7717879983140168]]>
                Penn Treebank for syntax and the Penn PropBank for 
predicate-argument
                structure. Its semantic representation will include word sense
                disambiguation for nouns and verbs, with each word sense 
connected to
-               an ontology, and coreference. The current goals call for 
annotation of
+               an ontology, and co-reference. The current goals call for 
annotation of
                over a million words each of English and Chinese, and half a 
million
-               words of Arabic over five years." 
(http://catalog.ldc.upenn.edu/LDC2011T03)
+               words of Arabic over five years." 
(https://catalog.ldc.upenn.edu/LDC2011T03)
        </para>
                <section id="tools.corpora.ontonotes.namefinder">
                <title>Name Finder Training</title>
        <para>
                The OntoNotes corpus can be used to train the Name Finder. The 
corpus
-               contains many different name types
+               contains different name types
                to train a model for a specific type only the built-in type 
filter
                option should be used.
        </para>
@@ -535,11 +535,11 @@ path: /dev/opennlp/trunk/opennlp-tools/en-ontonotes.bin]]>
                        OpenNLP can directly be trained and evaluated on 
labeled data in the brat format.
                        Instructions on how to use, download and install brat 
can be found on the project website:
 
-                       <ulink 
url="http://brat.nlplab.org";>http://brat.nlplab.org</ulink>
+                       <ulink 
url="https://brat.nlplab.org";>https://brat.nlplab.org</ulink>
 
                        Configuration of brat, including setting up the 
different entities and relations can be found at:
 
-                       <ulink 
url="http://brat.nlplab.org/configuration.html";>http://brat.nlplab.org/configuration.html</ulink>
+                       <ulink 
url="https://brat.nlplab.org/configuration.html";>https://brat.nlplab.org/configuration.html</ulink>
 
                </para>
 
@@ -548,7 +548,7 @@ path: /dev/opennlp/trunk/opennlp-tools/en-ontonotes.bin]]>
                        <title>Sentences and Tokens</title>
                        <para>
                                The brat annotation tool only adds named entity 
spans to the data and doesn't provide information
-                               about tokens and sentences. To train the name 
finder this information is required. By default it
+                               about tokens and sentences. To train the name 
finder this information is required. By default, it
                                is assumed that each line is a sentence and 
that tokens are whitespace separated. This can be
                                adjusted by providing a custom sentence 
detector and optional also a tokenizer.
 
diff --git a/opennlp-docs/src/docbkx/introduction.xml 
b/opennlp-docs/src/docbkx/introduction.xml
index 484e5b08..16acbbeb 100644
--- a/opennlp-docs/src/docbkx/introduction.xml
+++ b/opennlp-docs/src/docbkx/introduction.xml
@@ -34,7 +34,7 @@ under the License.
         </para>
 
         <para>
-        The goal of the OpenNLP project will be to create a mature toolkit for 
the abovementioned tasks.
+        The goal of the OpenNLP project will be to create a mature toolkit for 
the aforementioned tasks.
         An additional goal is to provide a large number of pre-built models 
for a variety of languages, as
         well as the annotated text resources that those models are derived 
from.
         </para>
@@ -306,7 +306,7 @@ $ opennlp ToolNameEvaluator -model en-model-name.bin -lang 
en -data input.test -
                 documentation we will refer to these models as "OpenNLP 
models." All NLP
                 components of OpenNLP support this type of model. The sections 
below in
                 this documentation describe how to train and use these models. 
<ulink url="https://opennlp.apache.org/models.html";>Pre-trained
-                models</ulink> are available for some languages and some of 
the OpenNLP components.
+                models</ulink> are available for some languages and some 
OpenNLP components.
             </para>
         </section>
         <section id="intro.models.onnx">
@@ -318,7 +318,7 @@ $ opennlp ToolNameEvaluator -model en-model-name.bin -lang 
en -data input.test -
                 each of the OpenNLP components that supports ONNX models 
describes how to
                 use ONNX models for inference. Note that OpenNLP does not 
support training
                 models that can be used by the ONNX Runtime - ONNX models must 
be created
-                outside of OpenNLP using other tools.
+                outside OpenNLP using other tools.
             </para>
         </section>
     </section>
diff --git a/opennlp-docs/src/docbkx/langdetect.xml 
b/opennlp-docs/src/docbkx/langdetect.xml
index aef1fd41..3025d5e5 100644
--- a/opennlp-docs/src/docbkx/langdetect.xml
+++ b/opennlp-docs/src/docbkx/langdetect.xml
@@ -27,7 +27,7 @@ under the License.
                <title>Classifying</title>
                <para>
                The OpenNLP Language Detector classifies a document in 
ISO-639-3 languages according to the model capabilities.
-               A model can be trained with Maxent, Perceptron or Naive Bayes 
algorithms. By default normalizes a text and
+               A model can be trained with Maxent, Perceptron or Naive Bayes 
algorithms. By default, normalizes a text and
                        the context generator extracts n-grams of size 1, 2 and 
3. The n-gram sizes, the normalization and the
                        context generator can be customized by extending the 
LanguageDetectorFactory.
 
@@ -57,7 +57,7 @@ under the License.
                                                </row>
                                                <row>
                                                        
<entry>TwitterCharSequenceNormalizer</entry>
-                                                       <entry>Replaces 
hashtags and Twitter user names by blank spaces.</entry>
+                                                       <entry>Replaces 
hashtags and Twitter usernames by blank spaces.</entry>
                                                </row>
                                                <row>
                                                        
<entry>NumberCharSequenceNormalizer</entry>
@@ -160,8 +160,8 @@ $ bin/opennlp LanguageDetectorTrainer[.leipzig] -model 
modelFile [-params params
                <section id="tools.langdetect.training.leipzig">
                        <title>Training with Leipzig</title>
                        <para>
-                               The Leipzig Corpora collection presents corpora 
in different languages. The corpora is a collection
-                               of individual sentences collected from the web 
and newspapers. The Corpora is available as plain text
+                               The Leipzig Corpora collection presents corpora 
in different languages. The corpora are a collection
+                               of individual sentences collected from the web 
and newspapers. The Corpora are available as plain text
                                and as MySQL database tables. The OpenNLP 
integration can only use the plain text version.
                                The     individual plain text packages can be 
downloaded here:
                                <ulink 
url="https://wortschatz.uni-leipzig.de/en/download";>https://wortschatz.uni-leipzig.de/en/download</ulink>
@@ -184,7 +184,7 @@ $ bin/opennlp LanguageDetectorTrainer.leipzig -model 
modelFile [-params paramsFi
                        <para>
                                The following sequence of commands shows how to 
convert the Leipzig Corpora collection at folder
                                leipzig-train/ to the default Language Detector 
format, by creating groups of 5 sentences as documents
-                               and limiting to 10000 documents per language. 
Them, it shuffles the result and select the first
+                               and limiting to 10000 documents per language. 
Then, it shuffles the result and select the first
                                100000 lines as train corpus and the last 20000 
as evaluation corpus:
                                <screen>
                                        <![CDATA[
diff --git a/opennlp-docs/src/docbkx/lemmatizer.xml 
b/opennlp-docs/src/docbkx/lemmatizer.xml
index 8be62423..44356e04 100644
--- a/opennlp-docs/src/docbkx/lemmatizer.xml
+++ b/opennlp-docs/src/docbkx/lemmatizer.xml
@@ -25,7 +25,7 @@
                        postag of the word is required to find the lemma. For 
example, the form
                        `show' may refer
                        to either the verb "to show" or to the noun "show".
-                       Currently OpenNLP implement statistical and 
dictionary-based lemmatizers.
+                       Currently, OpenNLP implement statistical and 
dictionary-based lemmatizers.
                </para>
                <section id="tools.lemmatizer.tagging.cmdline">
                        <title>Lemmatizer Tool</title>
@@ -75,7 +75,7 @@ signed VBD sign
                        <title>Lemmatizer API</title>
                        <para>
                                The Lemmatizer can be embedded into an 
application via its API.
-                               Currently a statistical
+                               Currently, a statistical
                                and DictionaryLemmatizer are available. Note 
that these two methods are
                                complementary and
                                the DictionaryLemmatizer can also be used as a 
way of post-processing
@@ -153,7 +153,7 @@ shrapnel    NN      shrapnel
                ]]>
                </screen>
                                Alternatively, if a (word,postag) pair can 
output multiple lemmas, the
-                               the lemmatizer dictionary would consists of a 
text file containing, for 
+                               the lemmatizer dictionary would consist of a 
text file containing, for
                                each row, a word, its postag and the 
corresponding lemmas separated by "#":
                                <screen>
                <![CDATA[
@@ -267,7 +267,7 @@ Arguments description:
                </screen>
                                        Its now assumed that the english 
lemmatizer model should be trained
                                        from a file called
-                                       en-lemmatizer.train which is encoded as 
UTF-8. The following command will train the
+                                       'en-lemmatizer.train' which is encoded 
as UTF-8. The following command will train the
                                        lemmatizer and write the model to 
en-lemmatizer.bin:
                                        <screen>
                <![CDATA[
diff --git a/opennlp-docs/src/docbkx/machine-learning.xml 
b/opennlp-docs/src/docbkx/machine-learning.xml
index 2df092e2..80db8c69 100644
--- a/opennlp-docs/src/docbkx/machine-learning.xml
+++ b/opennlp-docs/src/docbkx/machine-learning.xml
@@ -31,7 +31,7 @@ under the License.
                Maximum entropy modeling is a framework for integrating 
information from many heterogeneous
                information sources for classification.  The data for a  
classification problem is described
                as a (potentially large) number of features.  These features 
can be quite complex and allow
-               the experimenter to make use of prior knowledge about what 
types of informations are expected
+               the experimenter to make use of prior knowledge about what 
types of information are expected
                to be important for classification. Each feature corresponds to 
a constraint on the model.
                We then compute the maximum entropy model, the model with the 
maximum entropy of all the models
                that satisfy the constraints.  This term may seem perverse, 
since we have spent most of the book
@@ -80,7 +80,7 @@ under the License.
                </para>
                <para>
                We have also set in place some interfaces and code to make it 
easier to automate the training
-               and evaluation process (the Evalable interface and the 
TrainEval class).  It is not necessary
+               and evaluation process (the Evaluable interface and the 
TrainEval class).  It is not necessary
                to use this functionality, but if you do you'll find it much 
easier to see how well your models
                are doing.  The opennlp.grok.preprocess.namefind package is an 
example of a maximum entropy
                component which uses this functionality.
diff --git a/opennlp-docs/src/docbkx/morfologik-addon.xml 
b/opennlp-docs/src/docbkx/morfologik-addon.xml
index 6f188448..27c2dcd0 100644
--- a/opennlp-docs/src/docbkx/morfologik-addon.xml
+++ b/opennlp-docs/src/docbkx/morfologik-addon.xml
@@ -30,27 +30,27 @@
                        <itemizedlist mark='opencircle'>
                                <listitem>
                                        <para>
-                                       The 
<code>MorfologikPOSTaggerFactory</code> extends <code>POSTaggerFactory</code>, 
which helps creating a POSTagger model with an embedded FSA TagDictionary.
+                                       The 
<code>MorfologikPOSTaggerFactory</code> extends <code>POSTaggerFactory</code>, 
which helps create a POSTagger model with an embedded FSA TagDictionary.
                                        </para>
                                </listitem>
                                <listitem>
                                        <para>
-                                       The 
<code>MorfologikTagDictionary</code> implements a FSA based 
<code>TagDictionary</code>, allowing for much smaller files than the default 
XML based with improved memory consumption.
+                                       The 
<code>MorfologikTagDictionary</code> implements an FSA based 
<code>TagDictionary</code>, allowing for much smaller files than the default 
XML based with improved memory consumption.
                                        </para>
                                </listitem>
                                <listitem>
                                        <para>
-                                       The <code>MorfologikLemmatizer</code> 
implements a FSA based <code>Lemmatizer</code> dictionaries.
+                                       The <code>MorfologikLemmatizer</code> 
implements an FSA based <code>Lemmatizer</code> dictionaries.
                                        </para>
                                </listitem>
                        </itemizedlist>
                </para>
                <para>
-               The first two implementations can be used directly from command 
line, as in the example bellow. Having a FSA Morfologik dictionary (see next 
section how to build one), you can train a POS Tagger
+               The first two implementations can be used directly from command 
line, as in the example bellow. Having an FSA Morfologik dictionary (see next 
section how to build one), you can train a POS Tagger
                model with an embedded FSA dictionary. 
                </para>
                <para>
-               The example trains a POSTagger with a CONLL corpus named 
<code>portuguese_bosque_train.conll</code> and a FSA dictionary named 
+               The example trains a POSTagger with a CONLL corpus named 
<code>portuguese_bosque_train.conll</code> and an FSA dictionary named
                <code>pt-morfologik.dict</code>. It will output a model named 
<code>pos-pt_fsadic.model</code>.
                
                <screen>
@@ -109,7 +109,7 @@ fsa.dict.encoder=prefix
                
                                <programlisting language="java">
                <![CDATA[
-// Part 1: compile a FSA lemma dictionary 
+// Part 1: compile an FSA lemma dictionary
    
 // we need the tabular dictionary. It is mandatory to have info 
 //  file with same name, but .info extension
diff --git a/opennlp-docs/src/docbkx/namefinder.xml 
b/opennlp-docs/src/docbkx/namefinder.xml
index a2dfd0fc..92e5c6bf 100644
--- a/opennlp-docs/src/docbkx/namefinder.xml
+++ b/opennlp-docs/src/docbkx/namefinder.xml
@@ -76,7 +76,7 @@ Mr . <START:person> Vinken <END> is chairman of Elsevier N.V. 
, the Dutch publis
                <para>
                        To use the Name Finder in a production system it is 
strongly recommended to embed it
                        directly into the application instead of using the 
command line interface.
-                       First the name finder model must be loaded into memory 
from disk or an other source.
+                       First the name finder model must be loaded into memory 
from disk or another source.
                        In the sample below it is loaded from disk.
                        <programlisting language="java">
                                <![CDATA[
@@ -143,7 +143,7 @@ String sentence[] = new String[]{
 Span nameSpans[] = nameFinder.find(sentence);]]>
                        </programlisting>
                        The nameSpans arrays contains now exactly one Span 
which marks the name Pierre Vinken. 
-                       The elements between the begin and end offsets are the 
name tokens. In this case the begin 
+                       The elements between the start and end offsets are the 
name tokens. In this case the start
                        offset is 0 and the end offset is 2. The Span object 
also knows the type of the entity.
                        In this case it is person (defined by the model). It 
can be retrieved with a call to Span.getType().
                        Additionally to the statistical Name Finder, OpenNLP 
also offers a dictionary and a regular
@@ -240,13 +240,13 @@ Arguments description:
                 encoding for reading and writing text, if absent the system 
default is used.]]>
                         </screen>
                         It is now assumed that the english person name finder 
model should be trained from a file
-                        called en-ner-person.train which is encoded as UTF-8. 
The following command will train
+                        called 'en-ner-person.train' which is encoded as 
UTF-8. The following command will train
                         the name finder and write the model to 
en-ner-person.bin:
                         <screen>
                                <![CDATA[
 $ opennlp TokenNameFinderTrainer -model en-ner-person.bin -lang en -data 
en-ner-person.train -encoding UTF-8]]>
                         </screen>
-The example above will train models with a pre-defined feature set. It is also 
possible to use the -resources parameter to generate features based on external 
knowledge such as those based on word representation (clustering) features. The 
external resources must all be placed in a resource directory which is then 
passed as a parameter. If this option is used it is then required to pass, via 
the -featuregen parameter, a XML custom feature generator which includes some 
of the clustering fe [...]
+The example above will train models with a pre-defined feature set. It is also 
possible to use the -resources parameter to generate features based on external 
knowledge such as those based on word representation (clustering) features. The 
external resources must all be placed in a resource directory which is then 
passed as a parameter. If this option is used it is then required to pass, via 
the -featuregen parameter, an XML custom feature generator which includes some 
clustering features [...]
                        <itemizedlist>
                                <listitem>
                                        <para>Space separated two column file 
specifying the token and the cluster class as generated by toolkits such as 
<ulink url="https://code.google.com/p/word2vec/";>word2vec</ulink>.</para>
@@ -309,7 +309,7 @@ try (ObjectStream modelOut = new BufferedOutputStream(new 
FileOutputStream(model
                        <para>
                                OpenNLP defines a default feature generation 
which is used when no custom feature
                                generation is specified. Users which want to 
experiment with the feature generation
-                               can provide a custom feature generator. Either 
via API or via an xml descriptor file.
+                               can provide a custom feature generator. Either 
via an API or via a xml descriptor file.
                        </para>
                        <section id="tools.namefind.training.featuregen.api">
                        <title>Feature Generation defined by API</title>
@@ -476,7 +476,7 @@ new NameFinderME(model);]]>
                              </row>
                              <row>
                                        
<entry>WindowFeatureGeneratorFactory</entry>
-                                       <entry><emphasis>prevLength</emphasis> 
and <emphasis>nextLength</emphasis> must be integers ans specify the window 
size</entry>
+                                       <entry><emphasis>prevLength</emphasis> 
and <emphasis>nextLength</emphasis> must be integers and specify the window 
size</entry>
                              </row>
                            </tbody>
                          </tgroup>
@@ -551,7 +551,7 @@ System.out.println(result.toString());]]>
                                <itemizedlist>
                                <listitem>
                                        <para>
-                                               <ulink  
url="http://cs.nyu.edu/cs/faculty/grishman/NEtask20.book_1.html";>
+                                               <ulink  
url="https://cs.nyu.edu/cs/faculty/grishman/NEtask20.book_1.html";>
                                                        MUC6
                                                </ulink>
                                        </para>
diff --git a/opennlp-docs/src/docbkx/parser.xml 
b/opennlp-docs/src/docbkx/parser.xml
index 7f947cae..64cac083 100644
--- a/opennlp-docs/src/docbkx/parser.xml
+++ b/opennlp-docs/src/docbkx/parser.xml
@@ -43,7 +43,7 @@ under the License.
                <para>
                The easiest way to try out the Parser is the command line tool.
                The tool is only intended for demonstration and testing.
-               Download the English chunking parser model from the our website 
and start the Parse
+               Download the English chunking parser model from the website and 
start the Parse
                Tool with the following command.
                                <screen>
                                <![CDATA[
@@ -140,7 +140,7 @@ Parse topParses[] = ParserTool.parseLine(sentence, parser, 
1);]]>
                Penn Treebank annotation guidelines can be found on the
             <ulink 
url="https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html";>Penn
 Treebank home page</ulink>.
                A parser model also contains a pos tagger model, depending on 
the amount of available
-               training data it is recommend to switch the tagger model 
against a tagger model which
+               training data it is recommended to switch the tagger model 
against a tagger model which
                was trained on a larger corpus. The pre-trained parser model 
provided on the website
                is doing this to achieve a better performance. (TODO: On which 
data is the model on
                the website trained, and say on which data the tagger model is 
trained)
@@ -322,7 +322,7 @@ Usage: opennlp ParserEvaluator[.ontonotes|frenchtreebank] 
[-misclassified true|f
                -data sampleData [-encoding charsetName]]]>
                </screen>
                                A sample of the command considering you have a 
data sample named
-                               en-parser-chunking.eval
+                               en-parser-chunking.eval,
                                and you trained a model called 
en-parser-chunking.bin:
                                <screen>
                                <![CDATA[
diff --git a/opennlp-docs/src/docbkx/postagger.xml 
b/opennlp-docs/src/docbkx/postagger.xml
index ad98178c..69eacc60 100644
--- a/opennlp-docs/src/docbkx/postagger.xml
+++ b/opennlp-docs/src/docbkx/postagger.xml
@@ -134,7 +134,7 @@ That_DT sounds_VBZ good_JJ ._.]]>
                        training material it is suggested to use an empty line.
                </para>
                <para>The Part-of-Speech Tagger can either be trained with a 
command line tool,
-               or via an training API.
+               or via a training API.
                </para>
                
                <section id="tools.postagger.training.tool">
@@ -195,7 +195,7 @@ $ opennlp POSTaggerTrainer -type maxent -model 
en-pos-maxent.bin \
                                <para>The application must open a sample data 
stream</para>
                        </listitem>
                        <listitem>
-                               <para>Call the POSTagger.train method</para>
+                               <para>Call the 'POSTagger.train' method</para>
                        </listitem>
                        <listitem>
                                <para>Save the POSModel to a file</para>
@@ -232,10 +232,10 @@ try (OutputStream modelOut = new BufferedOutputStream(new 
FileOutputStream(model
                <para>
                The tag dictionary is a word dictionary which specifies which 
tags a specific token can have. Using a tag
                dictionary has two advantages, inappropriate tags can not been 
assigned to tokens in the dictionary and the
-               beam search algorithm has to consider less possibilities and 
can search faster.
+               beam search algorithm has to consider fewer possibilities and 
can search faster.
                </para>
                <para>
-               The dictionary is defined in an xml format and can be created 
and stored with the POSDictionary class.
+               The dictionary is defined in a xml format and can be created 
and stored with the POSDictionary class.
                Please for now checkout the javadoc and source code of that 
class.
                </para>
                <para>Note: The format should be documented and sample code 
should show how to use the dictionary.
diff --git a/opennlp-docs/src/docbkx/sentdetect.xml 
b/opennlp-docs/src/docbkx/sentdetect.xml
index 2f2fd1e8..ee7868eb 100644
--- a/opennlp-docs/src/docbkx/sentdetect.xml
+++ b/opennlp-docs/src/docbkx/sentdetect.xml
@@ -32,7 +32,7 @@ under the License.
                marks the end of a sentence or not. In this sense a sentence is 
defined 
                as the longest white space trimmed character sequence between 
two punctuation
                marks. The first and last sentence make an exception to this 
rule. The first 
-               non whitespace character is assumed to be the begin of a 
sentence, and the 
+               non whitespace character is assumed to be the start of a 
sentence, and the
                last non whitespace character is assumed to be a sentence end.
                The sample text below should be segmented into its sentences.
                <screen>
@@ -50,7 +50,7 @@ Mr. Vinken is chairman of Elsevier N.V., the Dutch publishing 
group.
 Rudolph Agnew, 55 years old and former chairman of Consolidated Gold Fields 
PLC,
     was named a director of this British industrial conglomerate.]]>
                </screen>
-               Usually Sentence Detection is done before the text is tokenized 
and that's the way the pre-trained models on the web site are trained,
+               Usually Sentence Detection is done before the text is tokenized 
and that's the way the pre-trained models on the website are trained,
                but it is also possible to perform tokenization first and let 
the Sentence Detector process the already tokenized text.
                The OpenNLP Sentence Detector cannot identify sentence 
boundaries based on the contents of the sentence. A prominent example is the 
first sentence in an article where the title is mistakenly identified to be the 
first part of the first sentence.
                Most components in OpenNLP expect input which is segmented into 
sentences.
@@ -117,7 +117,7 @@ Span sentences[] = sentenceDetector.sentPosDetect("  First 
sentence. Second sent
                OpenNLP has a command line tool which is used to train the 
models available from the model
                download page on various corpora. The data must be converted to 
the OpenNLP Sentence Detector
                training format. Which is one sentence per line. An empty line 
indicates a document boundary.
-               In case the document boundary is unknown, its recommended to 
have an empty line every few ten
+               In case the document boundary is unknown, it's recommended to 
have an empty line every few ten
                sentences. Exactly like the output in the sample above.
                Usage of the tool:
                <screen>
diff --git a/opennlp-docs/src/docbkx/tokenizer.xml 
b/opennlp-docs/src/docbkx/tokenizer.xml
index d596d756..32d4f241 100644
--- a/opennlp-docs/src/docbkx/tokenizer.xml
+++ b/opennlp-docs/src/docbkx/tokenizer.xml
@@ -116,7 +116,7 @@ $ opennlp TokenizerME en-token.bin < article.txt > 
article-tokenized.txt]]>
                </para>
                <para>
                        Since most text comes truly raw and doesn't have 
sentence boundaries
-                       and such, its possible to create a pipe which first 
performs sentence
+                       and such, it's possible to create a pipe which first 
performs sentence
                        boundary detection and tokenization. The following 
sample illustrates
                        that.
                        <screen>
@@ -179,7 +179,7 @@ String tokens[] = tokenizer.tokenize("An input sample 
sentence.");]]>
 "An", "input", "sample", "sentence", "."]]>
                 </programlisting>
                        The second method, tokenizePos returns an array of 
Spans, each Span
-                       contain the begin and end character offsets of the 
token in the input
+                       contain the start and end character offsets of the 
token in the input
                        String.
                        <programlisting language="java">
                        <![CDATA[
@@ -215,7 +215,7 @@ double tokenProbs[] = tokenizer.getTokenProbabilities();]]>
                                available from the model download page on 
various corpora. The data
                                can be converted to the OpenNLP Tokenizer 
training format or used directly.
                 The OpenNLP format contains one sentence per line. Tokens are 
either separated by a
-                whitespace or by a special &lt;SPLIT&gt; tag. Tokens are split 
automaticaly on whitespace
+                whitespace or by a special &lt;SPLIT&gt; tag. Tokens are split 
automatically on whitespace
                 and at least one &lt;SPLIT&gt; tag must be present in the 
training text.
                                
                                The following sample shows the sample from 
above in the correct format.
@@ -413,12 +413,12 @@ He said "This is a test".]]>
 InputStream dictIn = new FileInputStream("latin-detokenizer.xml");
 DetokenizationDictionary dict = new DetokenizationDictionary(dictIn);]]>
                                </programlisting>
-                               After the rule dictionary is loadeed the 
DictionaryDetokenizer can be instantiated.
+                               After the rule dictionary is loaded the 
DictionaryDetokenizer can be instantiated.
                                <programlisting language="java">
                                        <![CDATA[
 Detokenizer detokenizer = new DictionaryDetokenizer(dict);]]>
                                </programlisting>
-                               The detokenizer offers two detokenize 
methods,the first detokenize the input tokens into a String.
+                               The detokenizer offers two detokenize methods, 
the first detokenize the input tokens into a String.
                                <programlisting language="java">
                                        <![CDATA[
 String[] tokens = new String[]{"A", "co", "-", "worker", "helped", "."};
diff --git a/opennlp-docs/src/docbkx/uima-integration.xml 
b/opennlp-docs/src/docbkx/uima-integration.xml
index a8bc5279..e29162ce 100644
--- a/opennlp-docs/src/docbkx/uima-integration.xml
+++ b/opennlp-docs/src/docbkx/uima-integration.xml
@@ -21,13 +21,13 @@ specific language governing permissions and limitations
 under the License.
 -->
 
-<chapter id="org.apche.opennlp.uima">
+<chapter id="org.apache.opennlp.uima">
 <title>UIMA Integration</title>
 <para>
        The UIMA Integration wraps the OpenNLP components in UIMA Analysis 
Engines which can 
        be used to automatically annotate text and train new OpenNLP models 
from annotated text.
 </para>
-       <section id="org.apche.opennlp.running-pear-sample">
+       <section id="org.apache.opennlp.running-pear-sample">
                <title>Running the pear sample in CVD</title>
                <para>
                        The Cas Visual Debugger is shipped as part of the UIMA 
distribution and is a tool which can run
@@ -55,27 +55,27 @@ createPear:
      [copy] Copying 1 file to OpenNlpTextAnalyzer/lib
      [copy] Copying 3 files to OpenNlpTextAnalyzer/lib
     [mkdir] Created dir: OpenNlpTextAnalyzer/models
-      [get] Getting: http://opennlp.sourceforge.net/models-1.5/en-token.bin
+      [get] Getting: https://opennlp.sourceforge.net/models-1.5/en-token.bin
       [get] To: OpenNlpTextAnalyzer/models/en-token.bin
-      [get] Getting: http://opennlp.sourceforge.net/models-1.5/en-sent.bin
+      [get] Getting: https://opennlp.sourceforge.net/models-1.5/en-sent.bin
       [get] To: OpenNlpTextAnalyzer/models/en-sent.bin
-      [get] Getting: http://opennlp.sourceforge.net/models-1.5/en-ner-date.bin
+      [get] Getting: https://opennlp.sourceforge.net/models-1.5/en-ner-date.bin
       [get] To: OpenNlpTextAnalyzer/models/en-ner-date.bin
-      [get] Getting: 
http://opennlp.sourceforge.net/models-1.5/en-ner-location.bin
+      [get] Getting: 
https://opennlp.sourceforge.net/models-1.5/en-ner-location.bin
       [get] To: OpenNlpTextAnalyzer/models/en-ner-location.bin
-      [get] Getting: http://opennlp.sourceforge.net/models-1.5/en-ner-money.bin
+      [get] Getting: 
https://opennlp.sourceforge.net/models-1.5/en-ner-money.bin
       [get] To: OpenNlpTextAnalyzer/models/en-ner-money.bin
-      [get] Getting: 
http://opennlp.sourceforge.net/models-1.5/en-ner-organization.bin
+      [get] Getting: 
https://opennlp.sourceforge.net/models-1.5/en-ner-organization.bin
       [get] To: OpenNlpTextAnalyzer/models/en-ner-organization.bin
-      [get] Getting: 
http://opennlp.sourceforge.net/models-1.5/en-ner-percentage.bin
+      [get] Getting: 
https://opennlp.sourceforge.net/models-1.5/en-ner-percentage.bin
       [get] To: OpenNlpTextAnalyzer/models/en-ner-percentage.bin
-      [get] Getting: 
http://opennlp.sourceforge.net/models-1.5/en-ner-person.bin
+      [get] Getting: 
https://opennlp.sourceforge.net/models-1.5/en-ner-person.bin
       [get] To: OpenNlpTextAnalyzer/models/en-ner-person.bin
-      [get] Getting: http://opennlp.sourceforge.net/models-1.5/en-ner-time.bin
+      [get] Getting: https://opennlp.sourceforge.net/models-1.5/en-ner-time.bin
       [get] To: OpenNlpTextAnalyzer/models/en-ner-time.bin
-      [get] Getting: 
http://opennlp.sourceforge.net/models-1.5/en-pos-maxent.bin
+      [get] Getting: 
https://opennlp.sourceforge.net/models-1.5/en-pos-maxent.bin
       [get] To: OpenNlpTextAnalyzer/models/en-pos-maxent.bin
-      [get] Getting: http://opennlp.sourceforge.net/models-1.5/en-chunker.bin
+      [get] Getting: https://opennlp.sourceforge.net/models-1.5/en-chunker.bin
       [get] To: OpenNlpTextAnalyzer/models/en-chunker.bin
       [zip] Building zip: OpenNlpTextAnalyzer.pear
 
@@ -92,7 +92,7 @@ Total time: 3 minutes 20 seconds]]>
                        must be written in English.
                </para>
        </section>
-       <section id="org.apche.opennlp.further-help">
+       <section id="org.apache.opennlp.further-help">
                <title>Further Help</title>
                <para>
                        For more information about how to use the integration 
please consult the javadoc of the individual

[opennlp] branch main updated: OPENNLP-1435 Clear typos from opennlp-docs module (#480)

Reply via email to