This is an automated email from the ASF dual-hosted git repository.
jzemerick pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/opennlp.git
The following commit(s) were added to refs/heads/main by this push:
new c671d607 OPENNLP-1435 Clear typos from opennlp-docs module (#480)
c671d607 is described below
commit c671d6075b531e0bdbff272187523c2c9eade75d
Author: Martin Wiesner <[email protected]>
AuthorDate: Tue Jan 3 15:50:21 2023 +0100
OPENNLP-1435 Clear typos from opennlp-docs module (#480)
- fixes typos in several dockbx files
- switches references to http URLs, if available, to a secure form (https)
- exchanges a non-reachable URL against a web-archived version of the
original
---
opennlp-docs/src/docbkx/chunker.xml | 2 +-
opennlp-docs/src/docbkx/cli.xml | 20 ++++++++---------
opennlp-docs/src/docbkx/corpora.xml | 32 ++++++++++++++--------------
opennlp-docs/src/docbkx/introduction.xml | 6 +++---
opennlp-docs/src/docbkx/langdetect.xml | 10 ++++-----
opennlp-docs/src/docbkx/lemmatizer.xml | 8 +++----
opennlp-docs/src/docbkx/machine-learning.xml | 4 ++--
opennlp-docs/src/docbkx/morfologik-addon.xml | 12 +++++------
opennlp-docs/src/docbkx/namefinder.xml | 14 ++++++------
opennlp-docs/src/docbkx/parser.xml | 6 +++---
opennlp-docs/src/docbkx/postagger.xml | 8 +++----
opennlp-docs/src/docbkx/sentdetect.xml | 6 +++---
opennlp-docs/src/docbkx/tokenizer.xml | 10 ++++-----
opennlp-docs/src/docbkx/uima-integration.xml | 28 ++++++++++++------------
14 files changed, 83 insertions(+), 83 deletions(-)
diff --git a/opennlp-docs/src/docbkx/chunker.xml
b/opennlp-docs/src/docbkx/chunker.xml
index 262f4734..5c65deac 100644
--- a/opennlp-docs/src/docbkx/chunker.xml
+++ b/opennlp-docs/src/docbkx/chunker.xml
@@ -150,7 +150,7 @@ Sequence topSequences[] = chunk.topKSequences(sent, pos);]]>
</para>
<para>
The training data can be converted to the OpenNLP chunker
training format,
- which is based on <ulink
url="http://www.cnts.ua.ac.be/conll2000/chunking">CoNLL2000</ulink>.
+ which is based on <ulink
url="https://www.cnts.ua.ac.be/conll2000/chunking">CoNLL2000</ulink>.
Other formats may also be available.
The training data consist of three columns separated one single
space. Each word has been put on a
separate line and there is an empty line after each sentence.
The first column contains
diff --git a/opennlp-docs/src/docbkx/cli.xml b/opennlp-docs/src/docbkx/cli.xml
index f809029a..adc6538d 100644
--- a/opennlp-docs/src/docbkx/cli.xml
+++ b/opennlp-docs/src/docbkx/cli.xml
@@ -32,7 +32,7 @@ under the License.
<title>The Command Line Interface</title>
-<para>This section details the available tools and parameters of the Command
Line Interface. For a introduction in its usage please refer to <xref
linkend='intro.cli'/>. </para>
+<para>This section details the available tools and parameters of the Command
Line Interface. For an introduction in its usage please refer to <xref
linkend='intro.cli'/>. </para>
<section id='tools.cli.doccat'>
@@ -92,7 +92,7 @@ Arguments description:
<entry>sentencesDir</entry>
<entry>sentencesDir</entry>
<entry>No</entry>
-<entry>Dir with Leipig sentences to be used</entry>
+<entry>Dir with Leipzig sentences to be used</entry>
</row>
<row>
<entry>encoding</entry>
@@ -139,7 +139,7 @@ Arguments description:
<entry>sentencesDir</entry>
<entry>sentencesDir</entry>
<entry>No</entry>
-<entry>Dir with Leipig sentences to be used</entry>
+<entry>Dir with Leipzig sentences to be used</entry>
</row>
<row>
<entry>encoding</entry>
@@ -197,7 +197,7 @@ Arguments description:
<entry>sentencesDir</entry>
<entry>sentencesDir</entry>
<entry>No</entry>
-<entry>Dir with Leipig sentences to be used</entry>
+<entry>Dir with Leipzig sentences to be used</entry>
</row>
<row>
<entry>encoding</entry>
@@ -232,7 +232,7 @@ Usage: opennlp DoccatConverter help|leipzig
[help|options...]
<entry>sentencesDir</entry>
<entry>sentencesDir</entry>
<entry>No</entry>
-<entry>Dir with Leipig sentences to be used</entry>
+<entry>Dir with Leipzig sentences to be used</entry>
</row>
<row>
<entry>encoding</entry>
@@ -299,7 +299,7 @@ Arguments description:
<entry>sentencesDir</entry>
<entry>sentencesDir</entry>
<entry>No</entry>
-<entry>Dir with Leipig sentences to be used</entry>
+<entry>Dir with Leipzig sentences to be used</entry>
</row>
<row>
<entry>sentencesPerSample</entry>
@@ -346,7 +346,7 @@ Usage: opennlp LanguageDetectorConverter help|leipzig
[help|options...]
<entry>sentencesDir</entry>
<entry>sentencesDir</entry>
<entry>No</entry>
-<entry>Dir with Leipig sentences to be used</entry>
+<entry>Dir with Leipzig sentences to be used</entry>
</row>
<row>
<entry>sentencesPerSample</entry>
@@ -410,7 +410,7 @@ Arguments description:
<entry>sentencesDir</entry>
<entry>sentencesDir</entry>
<entry>No</entry>
-<entry>Dir with Leipig sentences to be used</entry>
+<entry>Dir with Leipzig sentences to be used</entry>
</row>
<row>
<entry>sentencesPerSample</entry>
@@ -469,7 +469,7 @@ Arguments description:
<entry>sentencesDir</entry>
<entry>sentencesDir</entry>
<entry>No</entry>
-<entry>Dir with Leipig sentences to be used</entry>
+<entry>Dir with Leipzig sentences to be used</entry>
</row>
<row>
<entry>sentencesPerSample</entry>
@@ -3919,7 +3919,7 @@ Usage: opennlp ChunkerConverter help|ad [help|options...]
Usage: opennlp Parser [-bs n -ap n -k n -tk tok_model] model < sentences
-bs n: Use a beam size of n.
-ap f: Advance outcomes in with at least f% of the probability mass.
--k n: Show the top n parses. This will also display their log-probablities.
+-k n: Show the top n parses. This will also display their log-probabilities.
-tk tok_model: Use the specified tokenizer model to tokenize the sentences.
Defaults to a WhitespaceTokenizer.
]]>
diff --git a/opennlp-docs/src/docbkx/corpora.xml
b/opennlp-docs/src/docbkx/corpora.xml
index 187c9c31..b21f61a6 100644
--- a/opennlp-docs/src/docbkx/corpora.xml
+++ b/opennlp-docs/src/docbkx/corpora.xml
@@ -144,13 +144,13 @@ F-Measure: 0.9230575441395671]]>
<title>Getting the data</title>
<para>The data consists of three files per language: one
training file and two test files testa and testb.
The first test file will be used in the development phase for
finding good parameters for the learning system.
- The second test file will be used for the final evaluation.
Currently there are data files available for two languages:
+ The second test file will be used for the final evaluation.
Currently, there are data files available for two languages:
Spanish and Dutch.
</para>
<para>
The Spanish data is a collection of news wire articles made
available by the Spanish EFE News Agency. The articles are
- from May 2000. The annotation was carried out by the <ulink
url="http://www.talp.cat/">TALP Research Center</ulink> of the Technical
University of Catalonia (UPC)
- and the <ulink url="http://clic.ub.edu/">Center of Language and
Computation (CLiC)</ulink>of the University of Barcelona (UB), and funded by
the European Commission
+ from May 2000. The annotation was carried out by the <ulink
url="https://www.talp.cat/">TALP Research Center</ulink> of the Technical
University of Catalonia (UPC)
+ and the <ulink
url="https://web.archive.org/web/20220516042208/http://clic.ub.edu/">Center of
Language and Computation (CLiC)</ulink>of the University of Barcelona (UB), and
funded by the European Commission
through the NAMIC project (IST-1999-12392).
</para>
<para>
@@ -159,12 +159,12 @@ F-Measure: 0.9230575441395671]]>
</para>
<para>
You can find the Spanish files here:
- <ulink
url="http://www.lsi.upc.edu/~nlp/tools/nerc/nerc.html">http://www.lsi.upc.edu/~nlp/tools/nerc/nerc.html</ulink>
+ <ulink
url="https://www.lsi.upc.edu/~nlp/tools/nerc/nerc.html">https://www.lsi.upc.edu/~nlp/tools/nerc/nerc.html</ulink>
You must download esp.train.gz, unzip it and you will see the
file esp.train.
</para>
<para>
You can find the Dutch files here:
- <ulink
url="http://www.cnts.ua.ac.be/conll2002/ner.tgz">http://www.cnts.ua.ac.be/conll2002/ner.tgz</ulink>
+ <ulink
url="https://www.cnts.ua.ac.be/conll2002/ner.tgz">https://www.cnts.ua.ac.be/conll2002/ner.tgz</ulink>
You must unzip it and go to /ner/data/ned.train.gz, so you
unzip it too, and you will see the file ned.train.
</para>
</section>
@@ -260,7 +260,7 @@ path: .\es_ner_person.bin]]>
<para>
The English data is the Reuters Corpus, which is a collection
of news wire articles.
The Reuters Corpus can be obtained free of charges from the
NIST for research
- purposes: <ulink
url="http://trec.nist.gov/data/reuters/reuters.html">http://trec.nist.gov/data/reuters/reuters.html</ulink>
+ purposes: <ulink
url="https://trec.nist.gov/data/reuters/reuters.html">https://trec.nist.gov/data/reuters/reuters.html</ulink>
</para>
<para>
The German data is a collection of articles from the German
newspaper Frankfurter
@@ -387,16 +387,16 @@ F-Measure: 0.8267557582133971]]>
<section id="tools.corpora.arvores-deitadas">
<title>Arvores Deitadas</title>
<para>
- The Portuguese corpora available at <ulink
url="http://www.linguateca.pt">Floresta Sintá(c)tica</ulink> project follow the
Arvores Deitadas (AD) format. Apache OpenNLP includes tools to convert from AD
format to native format.
+ The Portuguese corpora available at <ulink
url="https://www.linguateca.pt">Floresta Sintá(c)tica</ulink> project follow
the Arvores Deitadas (AD) format. Apache OpenNLP includes tools to convert from
AD format to native format.
</para>
<section id="tools.corpora.arvores-deitadas.getting">
<title>Getting the data</title>
<para>
- The Corpus can be downloaded from here: <ulink
url="http://www.linguateca.pt/floresta/corpus.html">http://www.linguateca.pt/floresta/corpus.html</ulink>
+ The Corpus can be downloaded from here: <ulink
url="https://www.linguateca.pt/floresta/corpus.html">https://www.linguateca.pt/floresta/corpus.html</ulink>
</para>
<para>
- The Name Finder models were trained using the Amazonia
corpus: <ulink
url="http://www.linguateca.pt/floresta/ficheiros/gz/amazonia.ad.gz">amazonia.ad</ulink>.
- The Chunker models were trained using the <ulink
url="http://www.linguateca.pt/floresta/ficheiros/gz/Bosque_CF_8.0.ad.txt.gz">Bosque_CF_8.0.ad</ulink>.
+ The Name Finder models were trained using the Amazonia
corpus: <ulink
url="https://www.linguateca.pt/floresta/ficheiros/gz/amazonia.ad.gz">amazonia.ad</ulink>.
+ The Chunker models were trained using the <ulink
url="https://www.linguateca.pt/floresta/ficheiros/gz/Bosque_CF_8.0.ad.txt.gz">Bosque_CF_8.0.ad</ulink>.
</para>
</section>
@@ -474,15 +474,15 @@ F-Measure: 0.7717879983140168]]>
Penn Treebank for syntax and the Penn PropBank for
predicate-argument
structure. Its semantic representation will include word sense
disambiguation for nouns and verbs, with each word sense
connected to
- an ontology, and coreference. The current goals call for
annotation of
+ an ontology, and co-reference. The current goals call for
annotation of
over a million words each of English and Chinese, and half a
million
- words of Arabic over five years."
(http://catalog.ldc.upenn.edu/LDC2011T03)
+ words of Arabic over five years."
(https://catalog.ldc.upenn.edu/LDC2011T03)
</para>
<section id="tools.corpora.ontonotes.namefinder">
<title>Name Finder Training</title>
<para>
The OntoNotes corpus can be used to train the Name Finder. The
corpus
- contains many different name types
+ contains different name types
to train a model for a specific type only the built-in type
filter
option should be used.
</para>
@@ -535,11 +535,11 @@ path: /dev/opennlp/trunk/opennlp-tools/en-ontonotes.bin]]>
OpenNLP can directly be trained and evaluated on
labeled data in the brat format.
Instructions on how to use, download and install brat
can be found on the project website:
- <ulink
url="http://brat.nlplab.org">http://brat.nlplab.org</ulink>
+ <ulink
url="https://brat.nlplab.org">https://brat.nlplab.org</ulink>
Configuration of brat, including setting up the
different entities and relations can be found at:
- <ulink
url="http://brat.nlplab.org/configuration.html">http://brat.nlplab.org/configuration.html</ulink>
+ <ulink
url="https://brat.nlplab.org/configuration.html">https://brat.nlplab.org/configuration.html</ulink>
</para>
@@ -548,7 +548,7 @@ path: /dev/opennlp/trunk/opennlp-tools/en-ontonotes.bin]]>
<title>Sentences and Tokens</title>
<para>
The brat annotation tool only adds named entity
spans to the data and doesn't provide information
- about tokens and sentences. To train the name
finder this information is required. By default it
+ about tokens and sentences. To train the name
finder this information is required. By default, it
is assumed that each line is a sentence and
that tokens are whitespace separated. This can be
adjusted by providing a custom sentence
detector and optional also a tokenizer.
diff --git a/opennlp-docs/src/docbkx/introduction.xml
b/opennlp-docs/src/docbkx/introduction.xml
index 484e5b08..16acbbeb 100644
--- a/opennlp-docs/src/docbkx/introduction.xml
+++ b/opennlp-docs/src/docbkx/introduction.xml
@@ -34,7 +34,7 @@ under the License.
</para>
<para>
- The goal of the OpenNLP project will be to create a mature toolkit for
the abovementioned tasks.
+ The goal of the OpenNLP project will be to create a mature toolkit for
the aforementioned tasks.
An additional goal is to provide a large number of pre-built models
for a variety of languages, as
well as the annotated text resources that those models are derived
from.
</para>
@@ -306,7 +306,7 @@ $ opennlp ToolNameEvaluator -model en-model-name.bin -lang
en -data input.test -
documentation we will refer to these models as "OpenNLP
models." All NLP
components of OpenNLP support this type of model. The sections
below in
this documentation describe how to train and use these models.
<ulink url="https://opennlp.apache.org/models.html">Pre-trained
- models</ulink> are available for some languages and some of
the OpenNLP components.
+ models</ulink> are available for some languages and some
OpenNLP components.
</para>
</section>
<section id="intro.models.onnx">
@@ -318,7 +318,7 @@ $ opennlp ToolNameEvaluator -model en-model-name.bin -lang
en -data input.test -
each of the OpenNLP components that supports ONNX models
describes how to
use ONNX models for inference. Note that OpenNLP does not
support training
models that can be used by the ONNX Runtime - ONNX models must
be created
- outside of OpenNLP using other tools.
+ outside OpenNLP using other tools.
</para>
</section>
</section>
diff --git a/opennlp-docs/src/docbkx/langdetect.xml
b/opennlp-docs/src/docbkx/langdetect.xml
index aef1fd41..3025d5e5 100644
--- a/opennlp-docs/src/docbkx/langdetect.xml
+++ b/opennlp-docs/src/docbkx/langdetect.xml
@@ -27,7 +27,7 @@ under the License.
<title>Classifying</title>
<para>
The OpenNLP Language Detector classifies a document in
ISO-639-3 languages according to the model capabilities.
- A model can be trained with Maxent, Perceptron or Naive Bayes
algorithms. By default normalizes a text and
+ A model can be trained with Maxent, Perceptron or Naive Bayes
algorithms. By default, normalizes a text and
the context generator extracts n-grams of size 1, 2 and
3. The n-gram sizes, the normalization and the
context generator can be customized by extending the
LanguageDetectorFactory.
@@ -57,7 +57,7 @@ under the License.
</row>
<row>
<entry>TwitterCharSequenceNormalizer</entry>
- <entry>Replaces
hashtags and Twitter user names by blank spaces.</entry>
+ <entry>Replaces
hashtags and Twitter usernames by blank spaces.</entry>
</row>
<row>
<entry>NumberCharSequenceNormalizer</entry>
@@ -160,8 +160,8 @@ $ bin/opennlp LanguageDetectorTrainer[.leipzig] -model
modelFile [-params params
<section id="tools.langdetect.training.leipzig">
<title>Training with Leipzig</title>
<para>
- The Leipzig Corpora collection presents corpora
in different languages. The corpora is a collection
- of individual sentences collected from the web
and newspapers. The Corpora is available as plain text
+ The Leipzig Corpora collection presents corpora
in different languages. The corpora are a collection
+ of individual sentences collected from the web
and newspapers. The Corpora are available as plain text
and as MySQL database tables. The OpenNLP
integration can only use the plain text version.
The individual plain text packages can be
downloaded here:
<ulink
url="https://wortschatz.uni-leipzig.de/en/download">https://wortschatz.uni-leipzig.de/en/download</ulink>
@@ -184,7 +184,7 @@ $ bin/opennlp LanguageDetectorTrainer.leipzig -model
modelFile [-params paramsFi
<para>
The following sequence of commands shows how to
convert the Leipzig Corpora collection at folder
leipzig-train/ to the default Language Detector
format, by creating groups of 5 sentences as documents
- and limiting to 10000 documents per language.
Them, it shuffles the result and select the first
+ and limiting to 10000 documents per language.
Then, it shuffles the result and select the first
100000 lines as train corpus and the last 20000
as evaluation corpus:
<screen>
<![CDATA[
diff --git a/opennlp-docs/src/docbkx/lemmatizer.xml
b/opennlp-docs/src/docbkx/lemmatizer.xml
index 8be62423..44356e04 100644
--- a/opennlp-docs/src/docbkx/lemmatizer.xml
+++ b/opennlp-docs/src/docbkx/lemmatizer.xml
@@ -25,7 +25,7 @@
postag of the word is required to find the lemma. For
example, the form
`show' may refer
to either the verb "to show" or to the noun "show".
- Currently OpenNLP implement statistical and
dictionary-based lemmatizers.
+ Currently, OpenNLP implement statistical and
dictionary-based lemmatizers.
</para>
<section id="tools.lemmatizer.tagging.cmdline">
<title>Lemmatizer Tool</title>
@@ -75,7 +75,7 @@ signed VBD sign
<title>Lemmatizer API</title>
<para>
The Lemmatizer can be embedded into an
application via its API.
- Currently a statistical
+ Currently, a statistical
and DictionaryLemmatizer are available. Note
that these two methods are
complementary and
the DictionaryLemmatizer can also be used as a
way of post-processing
@@ -153,7 +153,7 @@ shrapnel NN shrapnel
]]>
</screen>
Alternatively, if a (word,postag) pair can
output multiple lemmas, the
- the lemmatizer dictionary would consists of a
text file containing, for
+ the lemmatizer dictionary would consist of a
text file containing, for
each row, a word, its postag and the
corresponding lemmas separated by "#":
<screen>
<![CDATA[
@@ -267,7 +267,7 @@ Arguments description:
</screen>
Its now assumed that the english
lemmatizer model should be trained
from a file called
- en-lemmatizer.train which is encoded as
UTF-8. The following command will train the
+ 'en-lemmatizer.train' which is encoded
as UTF-8. The following command will train the
lemmatizer and write the model to
en-lemmatizer.bin:
<screen>
<![CDATA[
diff --git a/opennlp-docs/src/docbkx/machine-learning.xml
b/opennlp-docs/src/docbkx/machine-learning.xml
index 2df092e2..80db8c69 100644
--- a/opennlp-docs/src/docbkx/machine-learning.xml
+++ b/opennlp-docs/src/docbkx/machine-learning.xml
@@ -31,7 +31,7 @@ under the License.
Maximum entropy modeling is a framework for integrating
information from many heterogeneous
information sources for classification. The data for a
classification problem is described
as a (potentially large) number of features. These features
can be quite complex and allow
- the experimenter to make use of prior knowledge about what
types of informations are expected
+ the experimenter to make use of prior knowledge about what
types of information are expected
to be important for classification. Each feature corresponds to
a constraint on the model.
We then compute the maximum entropy model, the model with the
maximum entropy of all the models
that satisfy the constraints. This term may seem perverse,
since we have spent most of the book
@@ -80,7 +80,7 @@ under the License.
</para>
<para>
We have also set in place some interfaces and code to make it
easier to automate the training
- and evaluation process (the Evalable interface and the
TrainEval class). It is not necessary
+ and evaluation process (the Evaluable interface and the
TrainEval class). It is not necessary
to use this functionality, but if you do you'll find it much
easier to see how well your models
are doing. The opennlp.grok.preprocess.namefind package is an
example of a maximum entropy
component which uses this functionality.
diff --git a/opennlp-docs/src/docbkx/morfologik-addon.xml
b/opennlp-docs/src/docbkx/morfologik-addon.xml
index 6f188448..27c2dcd0 100644
--- a/opennlp-docs/src/docbkx/morfologik-addon.xml
+++ b/opennlp-docs/src/docbkx/morfologik-addon.xml
@@ -30,27 +30,27 @@
<itemizedlist mark='opencircle'>
<listitem>
<para>
- The
<code>MorfologikPOSTaggerFactory</code> extends <code>POSTaggerFactory</code>,
which helps creating a POSTagger model with an embedded FSA TagDictionary.
+ The
<code>MorfologikPOSTaggerFactory</code> extends <code>POSTaggerFactory</code>,
which helps create a POSTagger model with an embedded FSA TagDictionary.
</para>
</listitem>
<listitem>
<para>
- The
<code>MorfologikTagDictionary</code> implements a FSA based
<code>TagDictionary</code>, allowing for much smaller files than the default
XML based with improved memory consumption.
+ The
<code>MorfologikTagDictionary</code> implements an FSA based
<code>TagDictionary</code>, allowing for much smaller files than the default
XML based with improved memory consumption.
</para>
</listitem>
<listitem>
<para>
- The <code>MorfologikLemmatizer</code>
implements a FSA based <code>Lemmatizer</code> dictionaries.
+ The <code>MorfologikLemmatizer</code>
implements an FSA based <code>Lemmatizer</code> dictionaries.
</para>
</listitem>
</itemizedlist>
</para>
<para>
- The first two implementations can be used directly from command
line, as in the example bellow. Having a FSA Morfologik dictionary (see next
section how to build one), you can train a POS Tagger
+ The first two implementations can be used directly from command
line, as in the example bellow. Having an FSA Morfologik dictionary (see next
section how to build one), you can train a POS Tagger
model with an embedded FSA dictionary.
</para>
<para>
- The example trains a POSTagger with a CONLL corpus named
<code>portuguese_bosque_train.conll</code> and a FSA dictionary named
+ The example trains a POSTagger with a CONLL corpus named
<code>portuguese_bosque_train.conll</code> and an FSA dictionary named
<code>pt-morfologik.dict</code>. It will output a model named
<code>pos-pt_fsadic.model</code>.
<screen>
@@ -109,7 +109,7 @@ fsa.dict.encoder=prefix
<programlisting language="java">
<![CDATA[
-// Part 1: compile a FSA lemma dictionary
+// Part 1: compile an FSA lemma dictionary
// we need the tabular dictionary. It is mandatory to have info
// file with same name, but .info extension
diff --git a/opennlp-docs/src/docbkx/namefinder.xml
b/opennlp-docs/src/docbkx/namefinder.xml
index a2dfd0fc..92e5c6bf 100644
--- a/opennlp-docs/src/docbkx/namefinder.xml
+++ b/opennlp-docs/src/docbkx/namefinder.xml
@@ -76,7 +76,7 @@ Mr . <START:person> Vinken <END> is chairman of Elsevier N.V.
, the Dutch publis
<para>
To use the Name Finder in a production system it is
strongly recommended to embed it
directly into the application instead of using the
command line interface.
- First the name finder model must be loaded into memory
from disk or an other source.
+ First the name finder model must be loaded into memory
from disk or another source.
In the sample below it is loaded from disk.
<programlisting language="java">
<![CDATA[
@@ -143,7 +143,7 @@ String sentence[] = new String[]{
Span nameSpans[] = nameFinder.find(sentence);]]>
</programlisting>
The nameSpans arrays contains now exactly one Span
which marks the name Pierre Vinken.
- The elements between the begin and end offsets are the
name tokens. In this case the begin
+ The elements between the start and end offsets are the
name tokens. In this case the start
offset is 0 and the end offset is 2. The Span object
also knows the type of the entity.
In this case it is person (defined by the model). It
can be retrieved with a call to Span.getType().
Additionally to the statistical Name Finder, OpenNLP
also offers a dictionary and a regular
@@ -240,13 +240,13 @@ Arguments description:
encoding for reading and writing text, if absent the system
default is used.]]>
</screen>
It is now assumed that the english person name finder
model should be trained from a file
- called en-ner-person.train which is encoded as UTF-8.
The following command will train
+ called 'en-ner-person.train' which is encoded as
UTF-8. The following command will train
the name finder and write the model to
en-ner-person.bin:
<screen>
<![CDATA[
$ opennlp TokenNameFinderTrainer -model en-ner-person.bin -lang en -data
en-ner-person.train -encoding UTF-8]]>
</screen>
-The example above will train models with a pre-defined feature set. It is also
possible to use the -resources parameter to generate features based on external
knowledge such as those based on word representation (clustering) features. The
external resources must all be placed in a resource directory which is then
passed as a parameter. If this option is used it is then required to pass, via
the -featuregen parameter, a XML custom feature generator which includes some
of the clustering fe [...]
+The example above will train models with a pre-defined feature set. It is also
possible to use the -resources parameter to generate features based on external
knowledge such as those based on word representation (clustering) features. The
external resources must all be placed in a resource directory which is then
passed as a parameter. If this option is used it is then required to pass, via
the -featuregen parameter, an XML custom feature generator which includes some
clustering features [...]
<itemizedlist>
<listitem>
<para>Space separated two column file
specifying the token and the cluster class as generated by toolkits such as
<ulink url="https://code.google.com/p/word2vec/">word2vec</ulink>.</para>
@@ -309,7 +309,7 @@ try (ObjectStream modelOut = new BufferedOutputStream(new
FileOutputStream(model
<para>
OpenNLP defines a default feature generation
which is used when no custom feature
generation is specified. Users which want to
experiment with the feature generation
- can provide a custom feature generator. Either
via API or via an xml descriptor file.
+ can provide a custom feature generator. Either
via an API or via a xml descriptor file.
</para>
<section id="tools.namefind.training.featuregen.api">
<title>Feature Generation defined by API</title>
@@ -476,7 +476,7 @@ new NameFinderME(model);]]>
</row>
<row>
<entry>WindowFeatureGeneratorFactory</entry>
- <entry><emphasis>prevLength</emphasis>
and <emphasis>nextLength</emphasis> must be integers ans specify the window
size</entry>
+ <entry><emphasis>prevLength</emphasis>
and <emphasis>nextLength</emphasis> must be integers and specify the window
size</entry>
</row>
</tbody>
</tgroup>
@@ -551,7 +551,7 @@ System.out.println(result.toString());]]>
<itemizedlist>
<listitem>
<para>
- <ulink
url="http://cs.nyu.edu/cs/faculty/grishman/NEtask20.book_1.html">
+ <ulink
url="https://cs.nyu.edu/cs/faculty/grishman/NEtask20.book_1.html">
MUC6
</ulink>
</para>
diff --git a/opennlp-docs/src/docbkx/parser.xml
b/opennlp-docs/src/docbkx/parser.xml
index 7f947cae..64cac083 100644
--- a/opennlp-docs/src/docbkx/parser.xml
+++ b/opennlp-docs/src/docbkx/parser.xml
@@ -43,7 +43,7 @@ under the License.
<para>
The easiest way to try out the Parser is the command line tool.
The tool is only intended for demonstration and testing.
- Download the English chunking parser model from the our website
and start the Parse
+ Download the English chunking parser model from the website and
start the Parse
Tool with the following command.
<screen>
<![CDATA[
@@ -140,7 +140,7 @@ Parse topParses[] = ParserTool.parseLine(sentence, parser,
1);]]>
Penn Treebank annotation guidelines can be found on the
<ulink
url="https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html">Penn
Treebank home page</ulink>.
A parser model also contains a pos tagger model, depending on
the amount of available
- training data it is recommend to switch the tagger model
against a tagger model which
+ training data it is recommended to switch the tagger model
against a tagger model which
was trained on a larger corpus. The pre-trained parser model
provided on the website
is doing this to achieve a better performance. (TODO: On which
data is the model on
the website trained, and say on which data the tagger model is
trained)
@@ -322,7 +322,7 @@ Usage: opennlp ParserEvaluator[.ontonotes|frenchtreebank]
[-misclassified true|f
-data sampleData [-encoding charsetName]]]>
</screen>
A sample of the command considering you have a
data sample named
- en-parser-chunking.eval
+ en-parser-chunking.eval,
and you trained a model called
en-parser-chunking.bin:
<screen>
<![CDATA[
diff --git a/opennlp-docs/src/docbkx/postagger.xml
b/opennlp-docs/src/docbkx/postagger.xml
index ad98178c..69eacc60 100644
--- a/opennlp-docs/src/docbkx/postagger.xml
+++ b/opennlp-docs/src/docbkx/postagger.xml
@@ -134,7 +134,7 @@ That_DT sounds_VBZ good_JJ ._.]]>
training material it is suggested to use an empty line.
</para>
<para>The Part-of-Speech Tagger can either be trained with a
command line tool,
- or via an training API.
+ or via a training API.
</para>
<section id="tools.postagger.training.tool">
@@ -195,7 +195,7 @@ $ opennlp POSTaggerTrainer -type maxent -model
en-pos-maxent.bin \
<para>The application must open a sample data
stream</para>
</listitem>
<listitem>
- <para>Call the POSTagger.train method</para>
+ <para>Call the 'POSTagger.train' method</para>
</listitem>
<listitem>
<para>Save the POSModel to a file</para>
@@ -232,10 +232,10 @@ try (OutputStream modelOut = new BufferedOutputStream(new
FileOutputStream(model
<para>
The tag dictionary is a word dictionary which specifies which
tags a specific token can have. Using a tag
dictionary has two advantages, inappropriate tags can not been
assigned to tokens in the dictionary and the
- beam search algorithm has to consider less possibilities and
can search faster.
+ beam search algorithm has to consider fewer possibilities and
can search faster.
</para>
<para>
- The dictionary is defined in an xml format and can be created
and stored with the POSDictionary class.
+ The dictionary is defined in a xml format and can be created
and stored with the POSDictionary class.
Please for now checkout the javadoc and source code of that
class.
</para>
<para>Note: The format should be documented and sample code
should show how to use the dictionary.
diff --git a/opennlp-docs/src/docbkx/sentdetect.xml
b/opennlp-docs/src/docbkx/sentdetect.xml
index 2f2fd1e8..ee7868eb 100644
--- a/opennlp-docs/src/docbkx/sentdetect.xml
+++ b/opennlp-docs/src/docbkx/sentdetect.xml
@@ -32,7 +32,7 @@ under the License.
marks the end of a sentence or not. In this sense a sentence is
defined
as the longest white space trimmed character sequence between
two punctuation
marks. The first and last sentence make an exception to this
rule. The first
- non whitespace character is assumed to be the begin of a
sentence, and the
+ non whitespace character is assumed to be the start of a
sentence, and the
last non whitespace character is assumed to be a sentence end.
The sample text below should be segmented into its sentences.
<screen>
@@ -50,7 +50,7 @@ Mr. Vinken is chairman of Elsevier N.V., the Dutch publishing
group.
Rudolph Agnew, 55 years old and former chairman of Consolidated Gold Fields
PLC,
was named a director of this British industrial conglomerate.]]>
</screen>
- Usually Sentence Detection is done before the text is tokenized
and that's the way the pre-trained models on the web site are trained,
+ Usually Sentence Detection is done before the text is tokenized
and that's the way the pre-trained models on the website are trained,
but it is also possible to perform tokenization first and let
the Sentence Detector process the already tokenized text.
The OpenNLP Sentence Detector cannot identify sentence
boundaries based on the contents of the sentence. A prominent example is the
first sentence in an article where the title is mistakenly identified to be the
first part of the first sentence.
Most components in OpenNLP expect input which is segmented into
sentences.
@@ -117,7 +117,7 @@ Span sentences[] = sentenceDetector.sentPosDetect(" First
sentence. Second sent
OpenNLP has a command line tool which is used to train the
models available from the model
download page on various corpora. The data must be converted to
the OpenNLP Sentence Detector
training format. Which is one sentence per line. An empty line
indicates a document boundary.
- In case the document boundary is unknown, its recommended to
have an empty line every few ten
+ In case the document boundary is unknown, it's recommended to
have an empty line every few ten
sentences. Exactly like the output in the sample above.
Usage of the tool:
<screen>
diff --git a/opennlp-docs/src/docbkx/tokenizer.xml
b/opennlp-docs/src/docbkx/tokenizer.xml
index d596d756..32d4f241 100644
--- a/opennlp-docs/src/docbkx/tokenizer.xml
+++ b/opennlp-docs/src/docbkx/tokenizer.xml
@@ -116,7 +116,7 @@ $ opennlp TokenizerME en-token.bin < article.txt >
article-tokenized.txt]]>
</para>
<para>
Since most text comes truly raw and doesn't have
sentence boundaries
- and such, its possible to create a pipe which first
performs sentence
+ and such, it's possible to create a pipe which first
performs sentence
boundary detection and tokenization. The following
sample illustrates
that.
<screen>
@@ -179,7 +179,7 @@ String tokens[] = tokenizer.tokenize("An input sample
sentence.");]]>
"An", "input", "sample", "sentence", "."]]>
</programlisting>
The second method, tokenizePos returns an array of
Spans, each Span
- contain the begin and end character offsets of the
token in the input
+ contain the start and end character offsets of the
token in the input
String.
<programlisting language="java">
<![CDATA[
@@ -215,7 +215,7 @@ double tokenProbs[] = tokenizer.getTokenProbabilities();]]>
available from the model download page on
various corpora. The data
can be converted to the OpenNLP Tokenizer
training format or used directly.
The OpenNLP format contains one sentence per line. Tokens are
either separated by a
- whitespace or by a special <SPLIT> tag. Tokens are split
automaticaly on whitespace
+ whitespace or by a special <SPLIT> tag. Tokens are split
automatically on whitespace
and at least one <SPLIT> tag must be present in the
training text.
The following sample shows the sample from
above in the correct format.
@@ -413,12 +413,12 @@ He said "This is a test".]]>
InputStream dictIn = new FileInputStream("latin-detokenizer.xml");
DetokenizationDictionary dict = new DetokenizationDictionary(dictIn);]]>
</programlisting>
- After the rule dictionary is loadeed the
DictionaryDetokenizer can be instantiated.
+ After the rule dictionary is loaded the
DictionaryDetokenizer can be instantiated.
<programlisting language="java">
<![CDATA[
Detokenizer detokenizer = new DictionaryDetokenizer(dict);]]>
</programlisting>
- The detokenizer offers two detokenize
methods,the first detokenize the input tokens into a String.
+ The detokenizer offers two detokenize methods,
the first detokenize the input tokens into a String.
<programlisting language="java">
<![CDATA[
String[] tokens = new String[]{"A", "co", "-", "worker", "helped", "."};
diff --git a/opennlp-docs/src/docbkx/uima-integration.xml
b/opennlp-docs/src/docbkx/uima-integration.xml
index a8bc5279..e29162ce 100644
--- a/opennlp-docs/src/docbkx/uima-integration.xml
+++ b/opennlp-docs/src/docbkx/uima-integration.xml
@@ -21,13 +21,13 @@ specific language governing permissions and limitations
under the License.
-->
-<chapter id="org.apche.opennlp.uima">
+<chapter id="org.apache.opennlp.uima">
<title>UIMA Integration</title>
<para>
The UIMA Integration wraps the OpenNLP components in UIMA Analysis
Engines which can
be used to automatically annotate text and train new OpenNLP models
from annotated text.
</para>
- <section id="org.apche.opennlp.running-pear-sample">
+ <section id="org.apache.opennlp.running-pear-sample">
<title>Running the pear sample in CVD</title>
<para>
The Cas Visual Debugger is shipped as part of the UIMA
distribution and is a tool which can run
@@ -55,27 +55,27 @@ createPear:
[copy] Copying 1 file to OpenNlpTextAnalyzer/lib
[copy] Copying 3 files to OpenNlpTextAnalyzer/lib
[mkdir] Created dir: OpenNlpTextAnalyzer/models
- [get] Getting: http://opennlp.sourceforge.net/models-1.5/en-token.bin
+ [get] Getting: https://opennlp.sourceforge.net/models-1.5/en-token.bin
[get] To: OpenNlpTextAnalyzer/models/en-token.bin
- [get] Getting: http://opennlp.sourceforge.net/models-1.5/en-sent.bin
+ [get] Getting: https://opennlp.sourceforge.net/models-1.5/en-sent.bin
[get] To: OpenNlpTextAnalyzer/models/en-sent.bin
- [get] Getting: http://opennlp.sourceforge.net/models-1.5/en-ner-date.bin
+ [get] Getting: https://opennlp.sourceforge.net/models-1.5/en-ner-date.bin
[get] To: OpenNlpTextAnalyzer/models/en-ner-date.bin
- [get] Getting:
http://opennlp.sourceforge.net/models-1.5/en-ner-location.bin
+ [get] Getting:
https://opennlp.sourceforge.net/models-1.5/en-ner-location.bin
[get] To: OpenNlpTextAnalyzer/models/en-ner-location.bin
- [get] Getting: http://opennlp.sourceforge.net/models-1.5/en-ner-money.bin
+ [get] Getting:
https://opennlp.sourceforge.net/models-1.5/en-ner-money.bin
[get] To: OpenNlpTextAnalyzer/models/en-ner-money.bin
- [get] Getting:
http://opennlp.sourceforge.net/models-1.5/en-ner-organization.bin
+ [get] Getting:
https://opennlp.sourceforge.net/models-1.5/en-ner-organization.bin
[get] To: OpenNlpTextAnalyzer/models/en-ner-organization.bin
- [get] Getting:
http://opennlp.sourceforge.net/models-1.5/en-ner-percentage.bin
+ [get] Getting:
https://opennlp.sourceforge.net/models-1.5/en-ner-percentage.bin
[get] To: OpenNlpTextAnalyzer/models/en-ner-percentage.bin
- [get] Getting:
http://opennlp.sourceforge.net/models-1.5/en-ner-person.bin
+ [get] Getting:
https://opennlp.sourceforge.net/models-1.5/en-ner-person.bin
[get] To: OpenNlpTextAnalyzer/models/en-ner-person.bin
- [get] Getting: http://opennlp.sourceforge.net/models-1.5/en-ner-time.bin
+ [get] Getting: https://opennlp.sourceforge.net/models-1.5/en-ner-time.bin
[get] To: OpenNlpTextAnalyzer/models/en-ner-time.bin
- [get] Getting:
http://opennlp.sourceforge.net/models-1.5/en-pos-maxent.bin
+ [get] Getting:
https://opennlp.sourceforge.net/models-1.5/en-pos-maxent.bin
[get] To: OpenNlpTextAnalyzer/models/en-pos-maxent.bin
- [get] Getting: http://opennlp.sourceforge.net/models-1.5/en-chunker.bin
+ [get] Getting: https://opennlp.sourceforge.net/models-1.5/en-chunker.bin
[get] To: OpenNlpTextAnalyzer/models/en-chunker.bin
[zip] Building zip: OpenNlpTextAnalyzer.pear
@@ -92,7 +92,7 @@ Total time: 3 minutes 20 seconds]]>
must be written in English.
</para>
</section>
- <section id="org.apche.opennlp.further-help">
+ <section id="org.apache.opennlp.further-help">
<title>Further Help</title>
<para>
For more information about how to use the integration
please consult the javadoc of the individual