http://git-wip-us.apache.org/repos/asf/opennlp-site/blob/95530810/src/main/docs/1.7.2/manual/opennlp.html
----------------------------------------------------------------------
diff --git a/src/main/docs/1.7.2/manual/opennlp.html 
b/src/main/docs/1.7.2/manual/opennlp.html
new file mode 100644
index 0000000..84dc967
--- /dev/null
+++ b/src/main/docs/1.7.2/manual/opennlp.html
@@ -0,0 +1,5388 @@
+<html><head>
+      <meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">
+   <title>Apache OpenNLP Developer Documentation</title><link rel="stylesheet" 
href="css/opennlp-docs.css" type="text/css"><meta name="generator" 
content="DocBook XSL-NS Stylesheets V1.75.2"></head><body bgcolor="white" 
text="black" link="#0000FF" vlink="#840084" alink="#0000FF"><div lang="en" 
class="book" title="Apache OpenNLP Developer Documentation"><div 
class="titlepage"><div><div><h1 class="title"><a name="d4e1"></a>Apache OpenNLP 
Developer Documentation</h1></div><div><div class="authorgroup">
+                       <h3 class="corpauthor">Written and maintained by the 
Apache OpenNLP Development
+                               Community</h3>
+               </div></div><div><p class="releaseinfo">
+                       Version 1.7.2
+               </p></div><div><p class="copyright">Copyright &copy; 2011, 2017 
The Apache Software Foundation</p></div><div><div class="legalnotice" 
title="Legal Notice"><a name="d4e7"></a>
+                       <p title="License and Disclaimer">
+                               <b>License and Disclaimer.&nbsp;</b>
+                               
+                                       The ASF licenses this documentation
+                                       to you under the Apache License,
+                                       Version 2.0 (the
+                                       "License"); you may not use this 
documentation
+                                       except in compliance
+                                       with the License. You may obtain a copy 
of the
+                                       License at
+
+                                       </p><div class="blockquote"><blockquote 
class="blockquote">
+                                               <p>
+                                                       <a class="ulink" 
href="http://www.apache.org/licenses/LICENSE-2.0"; 
target="_top">http://www.apache.org/licenses/LICENSE-2.0</a>
+                                               </p>
+                                       </blockquote></div><p title="License 
and Disclaimer">
+
+                                       Unless required by applicable law or 
agreed to in writing,
+                                       this documentation and its contents are 
distributed under the License
+                                       on an
+                                       "AS IS" BASIS, WITHOUT WARRANTIES OR 
CONDITIONS OF ANY
+                                       KIND, either express or implied. See 
the License for the
+                                       specific language governing permissions 
and limitations
+                                       under the License.
+                               
+                       </p>
+               </div></div></div><hr></div><div class="toc"><p><b>Table of 
Contents</b></p><dl><dt><span class="chapter"><a href="#opennlp">1. 
Introduction</a></span></dt><dd><dl><dt><span class="section"><a 
href="#intro.description">Description</a></span></dt><dt><span 
class="section"><a href="#intro.general.library.structure">General Library 
Structure</a></span></dt><dt><span class="section"><a 
href="#intro.api">Application Program Interface (API). Generic 
Example</a></span></dt><dt><span class="section"><a href="#intro.cli">Command 
line interface (CLI)</a></span></dt><dd><dl><dt><span class="section"><a 
href="#intro.cli.description">Description</a></span></dt><dt><span 
class="section"><a href="#intro.cli.toolslist">List of 
tools</a></span></dt><dt><span class="section"><a 
href="#intro.cli.setup">Setting up</a></span></dt><dt><span class="section"><a 
href="#intro.cli.generic">Generic 
Example</a></span></dt></dl></dd></dl></dd><dt><span class="chapter"><a 
href="#tools.sentdetect">2. Sentence De
 tector</a></span></dt><dd><dl><dt><span class="section"><a 
href="#tools.sentdetect.detection">Sentence 
Detection</a></span></dt><dd><dl><dt><span class="section"><a 
href="#tools.sentdetect.detection.cmdline">Sentence Detection 
Tool</a></span></dt><dt><span class="section"><a 
href="#tools.sentdetect.detection.api">Sentence Detection 
API</a></span></dt></dl></dd><dt><span class="section"><a 
href="#tools.sentdetect.training">Sentence Detector 
Training</a></span></dt><dd><dl><dt><span class="section"><a 
href="#tools.sentdetect.training.tool">Training Tool</a></span></dt><dt><span 
class="section"><a href="#tools.sentdetect.training.api">Training 
API</a></span></dt></dl></dd><dt><span class="section"><a 
href="#tools.sentdetect.eval">Evaluation</a></span></dt><dd><dl><dt><span 
class="section"><a href="#tools.sentdetect.eval.tool">Evaluation 
Tool</a></span></dt></dl></dd></dl></dd><dt><span class="chapter"><a 
href="#tools.tokenizer">3. Tokenizer</a></span></dt><dd><dl><dt><span 
class="secti
 on"><a 
href="#tools.tokenizer.introduction">Tokenization</a></span></dt><dd><dl><dt><span
 class="section"><a href="#tools.tokenizer.cmdline">Tokenizer 
Tools</a></span></dt><dt><span class="section"><a 
href="#tools.tokenizer.api">Tokenizer API</a></span></dt></dl></dd><dt><span 
class="section"><a href="#tools.tokenizer.training">Tokenizer 
Training</a></span></dt><dd><dl><dt><span class="section"><a 
href="#tools.tokenizer.training.tool">Training Tool</a></span></dt><dt><span 
class="section"><a href="#tools.tokenizer.training.api">Training 
API</a></span></dt></dl></dd><dt><span class="section"><a 
href="#tools.tokenizer.detokenizing">Detokenizing</a></span></dt><dd><dl><dt><span
 class="section"><a href="#tools.tokenizer.detokenizing.api">Detokenizing 
API</a></span></dt><dt><span class="section"><a 
href="#tools.tokenizer.detokenizing.dict">Detokenizer 
Dictionary</a></span></dt></dl></dd></dl></dd><dt><span class="chapter"><a 
href="#tools.namefind">4. Name Finder</a></span></dt><dd><dl><d
 t><span class="section"><a href="#tools.namefind.recognition">Named Entity 
Recognition</a></span></dt><dd><dl><dt><span class="section"><a 
href="#tools.namefind.recognition.cmdline">Name Finder 
Tool</a></span></dt><dt><span class="section"><a 
href="#tools.namefind.recognition.api">Name Finder 
API</a></span></dt></dl></dd><dt><span class="section"><a 
href="#tools.namefind.training">Name Finder 
Training</a></span></dt><dd><dl><dt><span class="section"><a 
href="#tools.namefind.training.tool">Training Tool</a></span></dt><dt><span 
class="section"><a href="#tools.namefind.training.api">Training 
API</a></span></dt><dt><span class="section"><a 
href="#tools.namefind.training.featuregen">Custom Feature 
Generation</a></span></dt></dl></dd><dt><span class="section"><a 
href="#tools.namefind.eval">Evaluation</a></span></dt><dd><dl><dt><span 
class="section"><a href="#tools.namefind.eval.tool">Evaluation 
Tool</a></span></dt><dt><span class="section"><a 
href="#tools.namefind.eval.api">Evaluation AP
 I</a></span></dt></dl></dd><dt><span class="section"><a 
href="#tools.namefind.annotation_guides">Named Entity Annotation 
Guidelines</a></span></dt></dl></dd><dt><span class="chapter"><a 
href="#tools.doccat">5. Document Categorizer</a></span></dt><dd><dl><dt><span 
class="section"><a 
href="#tools.doccat.classifying">Classifying</a></span></dt><dd><dl><dt><span 
class="section"><a href="#tools.doccat.classifying.cmdline">Document 
Categorizer Tool</a></span></dt><dt><span class="section"><a 
href="#tools.doccat.classifying.api">Document Categorizer 
API</a></span></dt></dl></dd><dt><span class="section"><a 
href="#tools.doccat.training">Training</a></span></dt><dd><dl><dt><span 
class="section"><a href="#tools.doccat.training.tool">Training 
Tool</a></span></dt><dt><span class="section"><a 
href="#tools.doccat.training.api">Training 
API</a></span></dt></dl></dd></dl></dd><dt><span class="chapter"><a 
href="#tools.postagger">6. Part-of-Speech 
Tagger</a></span></dt><dd><dl><dt><span class="sectio
 n"><a href="#tools.postagger.tagging">Tagging</a></span></dt><dd><dl><dt><span 
class="section"><a href="#tools.postagger.tagging.cmdline">POS Tagger 
Tool</a></span></dt><dt><span class="section"><a 
href="#tools.postagger.tagging.api">POS Tagger 
API</a></span></dt></dl></dd><dt><span class="section"><a 
href="#tools.postagger.training">Training</a></span></dt><dd><dl><dt><span 
class="section"><a href="#tools.postagger.training.tool">Training 
Tool</a></span></dt><dt><span class="section"><a 
href="#tools.postagger.training.api">Training API</a></span></dt><dt><span 
class="section"><a href="#tools.postagger.training.tagdict">Tag 
Dictionary</a></span></dt></dl></dd><dt><span class="section"><a 
href="#tools.postagger.eval">Evaluation</a></span></dt><dd><dl><dt><span 
class="section"><a href="#tools.postagger.eval.tool">Evaluation 
Tool</a></span></dt></dl></dd></dl></dd><dt><span class="chapter"><a 
href="#tools.lemmatizer">7. Lemmatizer</a></span></dt><dd><dl><dt><span 
class="section"><a hre
 f="#tools.lemmatizer.tagging.cmdline">Lemmatizer Tool</a></span></dt><dt><span 
class="section"><a href="#tools.lemmatizer.tagging.api">Lemmatizer 
API</a></span></dt><dt><span class="section"><a 
href="#tools.lemmatizer.training">Lemmatizer 
Training</a></span></dt><dd><dl><dt><span class="section"><a 
href="#tools.lemmatizer.training.tool">Training Tool</a></span></dt><dt><span 
class="section"><a href="#tools.lemmatizer.training.api">Training 
API</a></span></dt></dl></dd><dt><span class="section"><a 
href="#tools.lemmatizer.evaluation">Lemmatizer 
Evaluation</a></span></dt></dl></dd><dt><span class="chapter"><a 
href="#tools.chunker">8. Chunker</a></span></dt><dd><dl><dt><span 
class="section"><a 
href="#tools.parser.chunking">Chunking</a></span></dt><dd><dl><dt><span 
class="section"><a href="#tools.parser.chunking.cmdline">Chunker 
Tool</a></span></dt><dt><span class="section"><a 
href="#tools.parser.chunking.api">Chunking 
API</a></span></dt></dl></dd><dt><span class="section"><a href="#tool
 s.chunker.training">Chunker Training</a></span></dt><dd><dl><dt><span 
class="section"><a href="#tools.chunker.training.tool">Training 
Tool</a></span></dt><dt><span class="section"><a 
href="#tools.chunker.training.api">Training 
API</a></span></dt></dl></dd><dt><span class="section"><a 
href="#tools.chunker.evaluation">Chunker 
Evaluation</a></span></dt><dd><dl><dt><span class="section"><a 
href="#tools.chunker.evaluation.tool">Chunker Evaluation 
Tool</a></span></dt></dl></dd></dl></dd><dt><span class="chapter"><a 
href="#tools.parser">9. Parser</a></span></dt><dd><dl><dt><span 
class="section"><a 
href="#tools.parser.parsing">Parsing</a></span></dt><dd><dl><dt><span 
class="section"><a href="#tools.parser.parsing.cmdline">Parser 
Tool</a></span></dt><dt><span class="section"><a 
href="#tools.parser.parsing.api">Parsing API</a></span></dt></dl></dd><dt><span 
class="section"><a href="#tools.parser.training">Parser 
Training</a></span></dt><dd><dl><dt><span class="section"><a 
href="#tools.parser.
 training.tool">Training Tool</a></span></dt><dt><span class="section"><a 
href="#tools.parser.training.api">Training 
API</a></span></dt></dl></dd><dt><span class="section"><a 
href="#tools.parser.evaluation">Parser 
Evaluation</a></span></dt><dd><dl><dt><span class="section"><a 
href="#tools.parser.evaluation.tool">Parser Evaluation 
Tool</a></span></dt><dt><span class="section"><a 
href="#tools.parser.evaluation.api">Evaluation 
API</a></span></dt></dl></dd></dl></dd><dt><span class="chapter"><a 
href="#tools.coref">10. Coreference Resolution</a></span></dt><dt><span 
class="chapter"><a href="#tools.extension">11. Extending 
OpenNLP</a></span></dt><dd><dl><dt><span class="section"><a 
href="#tools.extension.writing">Writing an extension</a></span></dt><dt><span 
class="section"><a href="#tools.extension.osgi">Running in an OSGi 
container</a></span></dt></dl></dd><dt><span class="chapter"><a 
href="#tools.corpora">12. Corpora</a></span></dt><dd><dl><dt><span 
class="section"><a href="#tools.corpo
 ra.conll">CONLL</a></span></dt><dd><dl><dt><span class="section"><a 
href="#tools.corpora.conll.2000">CONLL 2000</a></span></dt><dt><span 
class="section"><a href="#tools.corpora.conll.2002">CONLL 
2002</a></span></dt><dt><span class="section"><a 
href="#tools.corpora.conll.2003">CONLL 2003</a></span></dt></dl></dd><dt><span 
class="section"><a href="#tools.corpora.arvores-deitadas">Arvores 
Deitadas</a></span></dt><dd><dl><dt><span class="section"><a 
href="#tools.corpora.arvores-deitadas.getting">Getting the 
data</a></span></dt><dt><span class="section"><a 
href="#tools.corpora.arvores-deitadas.converting">Converting the data 
(optional)</a></span></dt><dt><span class="section"><a 
href="#tools.corpora.arvores-deitadas.evaluation">Training and 
Evaluation</a></span></dt></dl></dd><dt><span class="section"><a 
href="#tools.corpora.leipzig">Leipzig Corpora</a></span></dt><dt><span 
class="section"><a href="#tools.corpora.ontonotes">OntoNotes Release 
4.0</a></span></dt><dd><dl><dt><span class="se
 ction"><a href="#tools.corpora.ontonotes.namefinder">Name Finder 
Training</a></span></dt></dl></dd><dt><span class="section"><a 
href="#tools.corpora.brat">Brat Format Support</a></span></dt><dd><dl><dt><span 
class="section"><a href="#tools.corpora.brat.webtool">Sentences and 
Tokens</a></span></dt><dt><span class="section"><a 
href="#tools.corpora.brat.training">Training</a></span></dt><dt><span 
class="section"><a 
href="#tools.corpora.brat.evaluation">Evaluation</a></span></dt><dt><span 
class="section"><a href="#tools.corpora.brat.cross-validation">Cross 
Validation</a></span></dt></dl></dd></dl></dd><dt><span class="chapter"><a 
href="#opennlp.ml">13. Machine Learning</a></span></dt><dd><dl><dt><span 
class="section"><a href="#opennlp.ml.maxent">Maximum 
Entropy</a></span></dt><dd><dl><dt><span class="section"><a 
href="#opennlp.ml.maxent.impl">Implementation</a></span></dt></dl></dd></dl></dd><dt><span
 class="chapter"><a href="#org.apche.opennlp.uima">14. UIMA 
Integration</a></span></dt>
 <dd><dl><dt><span class="section"><a 
href="#org.apche.opennlp.running-pear-sample">Running the pear sample in 
CVD</a></span></dt><dt><span class="section"><a 
href="#org.apche.opennlp.further-help">Further 
Help</a></span></dt></dl></dd><dt><span class="chapter"><a 
href="#tools.morfologik-addon">15. Morfologik 
Addon</a></span></dt><dd><dl><dt><span class="section"><a 
href="#tools.morfologik-addon.api">Morfologik 
Integration</a></span></dt><dt><span class="section"><a 
href="#tools.morfologik-addon.cmdline">Morfologik CLI 
Tools</a></span></dt></dl></dd><dt><span class="chapter"><a 
href="#tools.cli">16. The Command Line 
Interface</a></span></dt><dd><dl><dt><span class="section"><a 
href="#tools.cli.doccat">Doccat</a></span></dt><dd><dl><dt><span 
class="section"><a 
href="#tools.cli.doccat.Doccat">Doccat</a></span></dt><dt><span 
class="section"><a 
href="#tools.cli.doccat.DoccatTrainer">DoccatTrainer</a></span></dt><dt><span 
class="section"><a href="#tools.cli.doccat.DoccatEvaluator">DoccatE
 valuator</a></span></dt><dt><span class="section"><a 
href="#tools.cli.doccat.DoccatCrossValidator">DoccatCrossValidator</a></span></dt><dt><span
 class="section"><a 
href="#tools.cli.doccat.DoccatConverter">DoccatConverter</a></span></dt></dl></dd><dt><span
 class="section"><a 
href="#tools.cli.dictionary">Dictionary</a></span></dt><dd><dl><dt><span 
class="section"><a 
href="#tools.cli.dictionary.DictionaryBuilder">DictionaryBuilder</a></span></dt></dl></dd><dt><span
 class="section"><a 
href="#tools.cli.tokenizer">Tokenizer</a></span></dt><dd><dl><dt><span 
class="section"><a 
href="#tools.cli.tokenizer.SimpleTokenizer">SimpleTokenizer</a></span></dt><dt><span
 class="section"><a 
href="#tools.cli.tokenizer.TokenizerME">TokenizerME</a></span></dt><dt><span 
class="section"><a 
href="#tools.cli.tokenizer.TokenizerTrainer">TokenizerTrainer</a></span></dt><dt><span
 class="section"><a 
href="#tools.cli.tokenizer.TokenizerMEEvaluator">TokenizerMEEvaluator</a></span></dt><dt><span
 class="section"><a h
 
ref="#tools.cli.tokenizer.TokenizerCrossValidator">TokenizerCrossValidator</a></span></dt><dt><span
 class="section"><a 
href="#tools.cli.tokenizer.TokenizerConverter">TokenizerConverter</a></span></dt><dt><span
 class="section"><a 
href="#tools.cli.tokenizer.DictionaryDetokenizer">DictionaryDetokenizer</a></span></dt></dl></dd><dt><span
 class="section"><a 
href="#tools.cli.sentdetect">Sentdetect</a></span></dt><dd><dl><dt><span 
class="section"><a 
href="#tools.cli.sentdetect.SentenceDetector">SentenceDetector</a></span></dt><dt><span
 class="section"><a 
href="#tools.cli.sentdetect.SentenceDetectorTrainer">SentenceDetectorTrainer</a></span></dt><dt><span
 class="section"><a 
href="#tools.cli.sentdetect.SentenceDetectorEvaluator">SentenceDetectorEvaluator</a></span></dt><dt><span
 class="section"><a 
href="#tools.cli.sentdetect.SentenceDetectorCrossValidator">SentenceDetectorCrossValidator</a></span></dt><dt><span
 class="section"><a 
href="#tools.cli.sentdetect.SentenceDetectorConverter">Sentenc
 eDetectorConverter</a></span></dt></dl></dd><dt><span class="section"><a 
href="#tools.cli.namefind">Namefind</a></span></dt><dd><dl><dt><span 
class="section"><a 
href="#tools.cli.namefind.TokenNameFinder">TokenNameFinder</a></span></dt><dt><span
 class="section"><a 
href="#tools.cli.namefind.TokenNameFinderTrainer">TokenNameFinderTrainer</a></span></dt><dt><span
 class="section"><a 
href="#tools.cli.namefind.TokenNameFinderEvaluator">TokenNameFinderEvaluator</a></span></dt><dt><span
 class="section"><a 
href="#tools.cli.namefind.TokenNameFinderCrossValidator">TokenNameFinderCrossValidator</a></span></dt><dt><span
 class="section"><a 
href="#tools.cli.namefind.TokenNameFinderConverter">TokenNameFinderConverter</a></span></dt><dt><span
 class="section"><a 
href="#tools.cli.namefind.CensusDictionaryCreator">CensusDictionaryCreator</a></span></dt></dl></dd><dt><span
 class="section"><a 
href="#tools.cli.postag">Postag</a></span></dt><dd><dl><dt><span 
class="section"><a href="#tools.cli.postag.POSTag
 ger">POSTagger</a></span></dt><dt><span class="section"><a 
href="#tools.cli.postag.POSTaggerTrainer">POSTaggerTrainer</a></span></dt><dt><span
 class="section"><a 
href="#tools.cli.postag.POSTaggerEvaluator">POSTaggerEvaluator</a></span></dt><dt><span
 class="section"><a 
href="#tools.cli.postag.POSTaggerCrossValidator">POSTaggerCrossValidator</a></span></dt><dt><span
 class="section"><a 
href="#tools.cli.postag.POSTaggerConverter">POSTaggerConverter</a></span></dt></dl></dd><dt><span
 class="section"><a 
href="#tools.cli.lemmatizer">Lemmatizer</a></span></dt><dd><dl><dt><span 
class="section"><a 
href="#tools.cli.lemmatizer.LemmatizerME">LemmatizerME</a></span></dt><dt><span 
class="section"><a 
href="#tools.cli.lemmatizer.LemmatizerTrainerME">LemmatizerTrainerME</a></span></dt><dt><span
 class="section"><a 
href="#tools.cli.lemmatizer.LemmatizerEvaluator">LemmatizerEvaluator</a></span></dt></dl></dd><dt><span
 class="section"><a 
href="#tools.cli.chunker">Chunker</a></span></dt><dd><dl><dt><span 
 class="section"><a 
href="#tools.cli.chunker.ChunkerME">ChunkerME</a></span></dt><dt><span 
class="section"><a 
href="#tools.cli.chunker.ChunkerTrainerME">ChunkerTrainerME</a></span></dt><dt><span
 class="section"><a 
href="#tools.cli.chunker.ChunkerEvaluator">ChunkerEvaluator</a></span></dt><dt><span
 class="section"><a 
href="#tools.cli.chunker.ChunkerCrossValidator">ChunkerCrossValidator</a></span></dt><dt><span
 class="section"><a 
href="#tools.cli.chunker.ChunkerConverter">ChunkerConverter</a></span></dt></dl></dd><dt><span
 class="section"><a 
href="#tools.cli.parser">Parser</a></span></dt><dd><dl><dt><span 
class="section"><a 
href="#tools.cli.parser.Parser">Parser</a></span></dt><dt><span 
class="section"><a 
href="#tools.cli.parser.ParserTrainer">ParserTrainer</a></span></dt><dt><span 
class="section"><a 
href="#tools.cli.parser.ParserEvaluator">ParserEvaluator</a></span></dt><dt><span
 class="section"><a 
href="#tools.cli.parser.ParserConverter">ParserConverter</a></span></dt><dt><span
 class
 ="section"><a 
href="#tools.cli.parser.BuildModelUpdater">BuildModelUpdater</a></span></dt><dt><span
 class="section"><a 
href="#tools.cli.parser.CheckModelUpdater">CheckModelUpdater</a></span></dt><dt><span
 class="section"><a 
href="#tools.cli.parser.TaggerModelReplacer">TaggerModelReplacer</a></span></dt></dl></dd><dt><span
 class="section"><a 
href="#tools.cli.entitylinker">Entitylinker</a></span></dt><dd><dl><dt><span 
class="section"><a 
href="#tools.cli.entitylinker.EntityLinker">EntityLinker</a></span></dt></dl></dd><dt><span
 class="section"><a 
href="#tools.cli.languagemodel">Languagemodel</a></span></dt><dd><dl><dt><span 
class="section"><a 
href="#tools.cli.languagemodel.LanguageModel">LanguageModel</a></span></dt></dl></dd></dl></dd></dl></div><div
 class="list-of-tables"><p><b>List of Tables</b></p><dl><dt>4.1. <a 
href="#d4e278">Generator elements</a></dt></dl></div>
+       
+
+       
+       
+       <div class="chapter" title="Chapter&nbsp;1.&nbsp;Introduction"><div 
class="titlepage"><div><div><h2 class="title"><a 
name="opennlp"></a>Chapter&nbsp;1.&nbsp;Introduction</h2></div></div></div><div 
class="toc"><p><b>Table of Contents</b></p><dl><dt><span class="section"><a 
href="#intro.description">Description</a></span></dt><dt><span 
class="section"><a href="#intro.general.library.structure">General Library 
Structure</a></span></dt><dt><span class="section"><a 
href="#intro.api">Application Program Interface (API). Generic 
Example</a></span></dt><dt><span class="section"><a href="#intro.cli">Command 
line interface (CLI)</a></span></dt><dd><dl><dt><span class="section"><a 
href="#intro.cli.description">Description</a></span></dt><dt><span 
class="section"><a href="#intro.cli.toolslist">List of 
tools</a></span></dt><dt><span class="section"><a 
href="#intro.cli.setup">Setting up</a></span></dt><dt><span class="section"><a 
href="#intro.cli.generic">Generic Example</a></span></dt></dl></dd
 ></dl></div>
+
+    <div class="section" title="Description"><div 
class="titlepage"><div><div><h2 class="title" style="clear: both"><a 
name="intro.description"></a>Description</h2></div></div></div>
+        
+        <p>
+        The Apache OpenNLP library is a machine learning based toolkit for the 
processing of natural language text.
+        It supports the most common NLP tasks, such as tokenization, sentence 
segmentation,
+        part-of-speech tagging, named entity extraction, chunking, parsing, 
and coreference resolution.
+        These tasks are usually required to build more advanced text 
processing services.
+        OpenNLP also included maximum entropy and perceptron based machine 
learning.
+        </p>
+
+        <p>
+        The goal of the OpenNLP project will be to create a mature toolkit for 
the abovementioned tasks.
+        An additional goal is to provide a large number of pre-built models 
for a variety of languages, as
+        well as the annotated text resources that those models are derived 
from.
+        </p>
+    </div>
+
+    <div class="section" title="General Library Structure"><div 
class="titlepage"><div><div><h2 class="title" style="clear: both"><a 
name="intro.general.library.structure"></a>General Library 
Structure</h2></div></div></div>
+        
+        <p>The Apache OpenNLP library contains several components, enabling 
one to build
+            a full natural language processing pipeline. These components
+            include: sentence detector, tokenizer,
+            name finder, document categorizer, part-of-speech tagger, chunker, 
parser,
+            coreference resolution. Components contain parts which enable one 
to execute the
+            respective natural language processing task, to train a model and 
often also to evaluate a
+            model. Each of these facilities is accessible via its application 
program
+            interface (API). In addition, a command line interface (CLI) is 
provided for convenience
+            of experiments and training.
+        </p>
+    </div>
+
+    <div class="section" title="Application Program Interface (API). Generic 
Example"><div class="titlepage"><div><div><h2 class="title" style="clear: 
both"><a name="intro.api"></a>Application Program Interface (API). Generic 
Example</h2></div></div></div>
+        
+        <p>
+            OpenNLP components have similar APIs. Normally, to execute a task,
+            one should provide a model and an input.
+        </p>
+        <p>
+            A model is usually loaded by providing a FileInputStream with a 
model to a
+            constructor of the model class:
+            </p><pre class="programlisting">
+                    
+InputStream modelIn = <b class="hl-keyword">new</b> FileInputStream(<b 
class="hl-string"><i style="color:red">"lang-model-name.bin"</i></b>);
+
+<b class="hl-keyword">try</b> {
+  SomeModel model = <b class="hl-keyword">new</b> SomeModel(modelIn);
+}
+<b class="hl-keyword">catch</b> (IOException e) {
+  <i class="hl-comment" style="color: silver">//handle the exception</i>
+}
+<b class="hl-keyword">finally</b> {
+  <b class="hl-keyword">if</b> (null != modelIn) {
+    <b class="hl-keyword">try</b> {
+      modelIn.close();
+    }
+    <b class="hl-keyword">catch</b> (IOException e) {
+    }
+  }
+}
+            </pre><p>
+        </p>
+        <p>
+        After the model is loaded the tool itself can be instantiated.
+        </p><pre class="programlisting">
+                
+ToolName toolName = <b class="hl-keyword">new</b> ToolName(model);
+        </pre><p>
+        After the tool is instantiated, the processing task can be executed. 
The input and the
+        output formats are specific to the tool, but often the output is an 
array of String,
+        and the input is a String or an array of String.
+        </p><pre class="programlisting">
+                
+String output[] = toolName.executeTask(<b class="hl-string"><i 
style="color:red">"This is a sample text."</i></b>);
+        </pre><p>
+        </p>
+    </div>
+
+    <div class="section" title="Command line interface (CLI)"><div 
class="titlepage"><div><div><h2 class="title" style="clear: both"><a 
name="intro.cli"></a>Command line interface (CLI)</h2></div></div></div>
+        
+        <div class="section" title="Description"><div 
class="titlepage"><div><div><h3 class="title"><a 
name="intro.cli.description"></a>Description</h3></div></div></div>
+            
+            <p>
+                OpenNLP provides a command line script, serving as a unique 
entry point to all
+                included tools. The script is located in the bin directory of 
OpenNLP binary
+                distribution. Included are versions for Windows: opennlp.bat 
and Linux or
+                compatible systems: opennlp.
+            </p>
+        </div>
+        
+        <div class="section" title="List of tools"><div 
class="titlepage"><div><div><h3 class="title"><a 
name="intro.cli.toolslist"></a>List of tools</h3></div></div></div>
+            
+            <p>
+                       The list of command line tools for Apache OpenNLP 1.7.2,
+                       as well as a description of its arguments, is available 
at section <a class="xref" href="#tools.cli" title="Chapter&nbsp;16.&nbsp;The 
Command Line Interface">Chapter&nbsp;16, <i>The Command Line Interface</i></a>.
+            </p>
+        </div>
+
+        <div class="section" title="Setting up"><div 
class="titlepage"><div><div><h3 class="title"><a 
name="intro.cli.setup"></a>Setting up</h3></div></div></div>
+            
+            <p>
+                OpenNLP script uses JAVA_CMD and JAVA_HOME variables to 
determine which command to
+                use to execute Java virtual machine.
+            </p>
+            <p>
+                OpenNLP script uses OPENNLP_HOME variable to determine the 
location of the binary
+                distribution of OpenNLP. It is recommended to point this 
variable to the binary
+                distribution of current OpenNLP version and update PATH 
variable to include
+                $OPENNLP_HOME/bin or %OPENNLP_HOME%\bin.
+            </p>
+            <p>
+                Such configuration allows calling OpenNLP conveniently. 
Examples below
+                suppose this configuration has been done.
+            </p>
+        </div>
+
+        <div class="section" title="Generic Example"><div 
class="titlepage"><div><div><h3 class="title"><a 
name="intro.cli.generic"></a>Generic Example</h3></div></div></div>
+            
+
+            <p>
+                Apache OpenNLP provides a common command line script to access 
all its tools:
+                </p><pre class="screen">
+                
+$ opennlp
+                 </pre><p>
+                This script prints current version of the library and lists 
all available tools:
+                </p><pre class="screen">
+                
+OpenNLP &lt;VERSION&gt;. Usage: opennlp TOOL
+where TOOL is one of:
+  Doccat                            learnable document categorizer
+  DoccatTrainer                     trainer for the learnable document 
categorizer
+  DoccatConverter                   converts leipzig data format to native 
OpenNLP format
+  DictionaryBuilder                 builds a new dictionary
+  SimpleTokenizer                   character class tokenizer
+  TokenizerME                       learnable tokenizer
+  TokenizerTrainer                  trainer for the learnable tokenizer
+  TokenizerMEEvaluator              evaluator for the learnable tokenizer
+  TokenizerCrossValidator           K-fold cross validator for the learnable 
tokenizer
+  TokenizerConverter                converts foreign data formats 
(namefinder,conllx,pos) to native OpenNLP format
+  DictionaryDetokenizer
+  SentenceDetector                  learnable sentence detector
+  SentenceDetectorTrainer           trainer for the learnable sentence detector
+  SentenceDetectorEvaluator         evaluator for the learnable sentence 
detector
+  SentenceDetectorCrossValidator    K-fold cross validator for the learnable 
sentence detector
+  SentenceDetectorConverter         converts foreign data formats 
(namefinder,conllx,pos) to native OpenNLP format
+  TokenNameFinder                   learnable name finder
+  TokenNameFinderTrainer            trainer for the learnable name finder
+  TokenNameFinderEvaluator          Measures the performance of the NameFinder 
model with the reference data
+  TokenNameFinderCrossValidator     K-fold cross validator for the learnable 
Name Finder
+  TokenNameFinderConverter          converts foreign data formats 
(bionlp2004,conll03,conll02,ad) to native OpenNLP format
+  CensusDictionaryCreator           Converts 1990 US Census names into a 
dictionary
+  POSTagger                         learnable part of speech tagger
+  POSTaggerTrainer                  trains a model for the part-of-speech 
tagger
+  POSTaggerEvaluator                Measures the performance of the POS tagger 
model with the reference data
+  POSTaggerCrossValidator           K-fold cross validator for the learnable 
POS tagger
+  POSTaggerConverter                converts conllx data format to native 
OpenNLP format
+  ChunkerME                         learnable chunker
+  ChunkerTrainerME                  trainer for the learnable chunker
+  ChunkerEvaluator                  Measures the performance of the Chunker 
model with the reference data
+  ChunkerCrossValidator             K-fold cross validator for the chunker
+  ChunkerConverter                  converts ad data format to native OpenNLP 
format
+  Parser                            performs full syntactic parsing
+  ParserTrainer                     trains the learnable parser
+  ParserEvaluator                                      Measures the 
performance of the Parser model with the reference data
+  BuildModelUpdater                 trains and updates the build model in a 
parser model
+  CheckModelUpdater                 trains and updates the check model in a 
parser model
+  TaggerModelReplacer               replaces the tagger model in a parser model
+All tools print help when invoked with help parameter
+Example: opennlp SimpleTokenizer help
+
+                </pre><p>
+            </p>
+            <p>OpenNLP tools have similar command line structure and options. 
To discover tool
+                options, run it with no parameters:
+                </p><pre class="screen">
+                
+$ opennlp ToolName
+                 </pre><p>
+                The tool will output two blocks of help.
+            </p>
+            <p>
+                The first block describes the general structure of this tool 
command line:
+                </p><pre class="screen">
+                
+Usage: opennlp TokenizerTrainer[.namefinder|.conllx|.pos] [-abbDict path] ...  
-model modelFile ...
+                </pre><p>
+                The general structure of this tool command line includes the 
obligatory tool name
+                (TokenizerTrainer), the optional format parameters 
([.namefinder|.conllx|.pos]),
+                the optional parameters ([-abbDict path] ...), and the 
obligatory parameters
+                (-model modelFile ...).
+            </p>
+            <p>
+                The format parameters enable direct processing of non-native 
data without conversion.
+                Each format might have its own parameters, which are displayed 
if the tool is
+                executed without or with help parameter:
+                </p><pre class="screen">
+                
+$ opennlp TokenizerTrainer.conllx help
+                </pre><p>
+                </p><pre class="screen">
+                
+Usage: opennlp TokenizerTrainer.conllx [-abbDict path] [-alphaNumOpt 
isAlphaNumOpt] ...
+
+Arguments description:
+        -abbDict path
+                abbreviation dictionary in XML format.
+        ...
+                </pre><p>
+                To switch the tool to a specific format, add a dot and the 
format name after
+                the tool name:
+                </p><pre class="screen">
+                
+$ opennlp TokenizerTrainer.conllx -model en-pos.bin ...
+                </pre><p>
+            </p>
+            <p>
+                The second block of the help message describes the individual 
arguments:
+                </p><pre class="screen">
+                
+Arguments description:
+        -type maxent|perceptron|perceptron_sequence
+                The type of the token name finder model. One of 
maxent|perceptron|perceptron_sequence.
+        -dict dictionaryPath
+                The XML tag dictionary file
+        ...
+                </pre><p>
+            </p>
+            <p>
+                Most tools for processing need to be provided at least a model:
+                </p><pre class="screen">
+                
+$ opennlp ToolName lang-model-name.bin
+                 </pre><p>
+                When tool is executed this way, the model is loaded and the 
tool is waiting for
+                the input from standard input. This input is processed and 
printed to standard
+                output.
+            </p>
+            <p>Alternative, or one should say, most commonly used way is to 
use console input and
+                output redirection options to provide also an input and an 
output files:
+                </p><pre class="screen">
+            
+$ opennlp ToolName lang-model-name.bin &lt; input.txt &gt; output.txt
+                </pre><p>
+            </p>
+            <p>
+                Most tools for model training need to be provided first a 
model name,
+                optionally some training options (such as model type, number 
of iterations),
+                and then the data.
+            </p>
+            <p>
+                A model name is just a file name.
+            </p>
+            <p>
+                Training options often include number of iterations, cutoff,
+                abbreviations dictionary or something else. Sometimes it is 
possible to provide these
+                options via training options file. In this case these options 
are ignored and the
+                ones from the file are used.
+            </p>
+            <p>
+                For the data one has to specify the location of the data 
(filename) and often
+                language and encoding.
+            </p>
+            <p>
+                A generic example of a command line to launch a tool trainer 
might be:
+                </p><pre class="screen">
+                
+$ opennlp ToolNameTrainer -model en-model-name.bin -lang en -data input.train 
-encoding UTF-8
+                 </pre><p>
+                or with a format:
+                </p><pre class="screen">
+                
+$ opennlp ToolNameTrainer.conll03 -model en-model-name.bin -lang en -data 
input.train \
+                                  -types per -encoding UTF-8
+                 </pre><p>
+            </p>
+            <p>Most tools for model evaluation are similar to those for task 
execution, and
+                need to be provided fist a model name, optionally some 
evaluation options (such
+                as whether to print misclassified samples), and then the test 
data. A generic
+                example of a command line to launch an evaluation tool might 
be:
+                </p><pre class="screen">
+                
+$ opennlp ToolNameEvaluator -model en-model-name.bin -lang en -data input.test 
-encoding UTF-8
+                 </pre><p>
+            </p>
+        </div>
+    </div>
+
+</div>
+       <div class="chapter" title="Chapter&nbsp;2.&nbsp;Sentence 
Detector"><div class="titlepage"><div><div><h2 class="title"><a 
name="tools.sentdetect"></a>Chapter&nbsp;2.&nbsp;Sentence 
Detector</h2></div></div></div><div class="toc"><p><b>Table of 
Contents</b></p><dl><dt><span class="section"><a 
href="#tools.sentdetect.detection">Sentence 
Detection</a></span></dt><dd><dl><dt><span class="section"><a 
href="#tools.sentdetect.detection.cmdline">Sentence Detection 
Tool</a></span></dt><dt><span class="section"><a 
href="#tools.sentdetect.detection.api">Sentence Detection 
API</a></span></dt></dl></dd><dt><span class="section"><a 
href="#tools.sentdetect.training">Sentence Detector 
Training</a></span></dt><dd><dl><dt><span class="section"><a 
href="#tools.sentdetect.training.tool">Training Tool</a></span></dt><dt><span 
class="section"><a href="#tools.sentdetect.training.api">Training 
API</a></span></dt></dl></dd><dt><span class="section"><a 
href="#tools.sentdetect.eval">Evaluation</a></span></dt>
 <dd><dl><dt><span class="section"><a 
href="#tools.sentdetect.eval.tool">Evaluation 
Tool</a></span></dt></dl></dd></dl></div>
+
+       
+
+       <div class="section" title="Sentence Detection"><div 
class="titlepage"><div><div><h2 class="title" style="clear: both"><a 
name="tools.sentdetect.detection"></a>Sentence Detection</h2></div></div></div>
+               
+               <p>
+               The OpenNLP Sentence Detector can detect that a punctuation 
character 
+               marks the end of a sentence or not. In this sense a sentence is 
defined 
+               as the longest white space trimmed character sequence between 
two punctuation
+               marks. The first and last sentence make an exception to this 
rule. The first 
+               non whitespace character is assumed to be the begin of a 
sentence, and the 
+               last non whitespace character is assumed to be a sentence end.
+               The sample text below should be segmented into its sentences.
+               </p><pre class="screen">
+                               
+Pierre Vinken, 61 years old, will join the board as a nonexecutive director 
Nov. 29. Mr. Vinken is
+chairman of Elsevier N.V., the Dutch publishing group. Rudolph Agnew, 55 years
+old and former chairman of Consolidated Gold Fields PLC, was named a director 
of this
+British industrial conglomerate.
+               </pre><p>
+               After detecting the sentence boundaries each sentence is 
written in its own line.
+               </p><pre class="screen">
+                               
+Pierre Vinken, 61 years old, will join the board as a nonexecutive director 
Nov. 29.
+Mr. Vinken is chairman of Elsevier N.V., the Dutch publishing group.
+Rudolph Agnew, 55 years old and former chairman of Consolidated Gold Fields 
PLC,
+    was named a director of this British industrial conglomerate.
+               </pre><p>
+               Usually Sentence Detection is done before the text is tokenized 
and that's the way the pre-trained models on the web site are trained,
+               but it is also possible to perform tokenization first and let 
the Sentence Detector process the already tokenized text.
+               The OpenNLP Sentence Detector cannot identify sentence 
boundaries based on the contents of the sentence. A prominent example is the 
first sentence in an article where the title is mistakenly identified to be the 
first part of the first sentence.
+               Most components in OpenNLP expect input which is segmented into 
sentences.
+               </p>
+               
+               <div class="section" title="Sentence Detection Tool"><div 
class="titlepage"><div><div><h3 class="title"><a 
name="tools.sentdetect.detection.cmdline"></a>Sentence Detection 
Tool</h3></div></div></div>
+               
+               <p>
+               The easiest way to try out the Sentence Detector is the command 
line tool. The tool is only intended for demonstration and testing.
+               Download the english sentence detector model and start the 
Sentence Detector Tool with this command:
+        </p><pre class="screen">
+        
+$ opennlp SentenceDetector en-sent.bin
+               </pre><p>
+               Just copy the sample text from above to the console. The 
Sentence Detector will read it and echo one sentence per line to the console.
+               Usually the input is read from a file and the output is 
redirected to another file. This can be achieved with the following command.
+               </p><pre class="screen">
+                               
+$ opennlp SentenceDetector en-sent.bin &lt; input.txt &gt; output.txt
+               </pre><p>
+               For the english sentence model from the website the input text 
should not be tokenized.
+               </p>
+               </div>
+               <div class="section" title="Sentence Detection API"><div 
class="titlepage"><div><div><h3 class="title"><a 
name="tools.sentdetect.detection.api"></a>Sentence Detection 
API</h3></div></div></div>
+               
+               <p>
+               The Sentence Detector can be easily integrated into an 
application via its API.
+               To instantiate the Sentence Detector the sentence model must be 
loaded first.
+               </p><pre class="programlisting">
+                               
+InputStream modelIn = <b class="hl-keyword">new</b> FileInputStream(<b 
class="hl-string"><i style="color:red">"en-sent.bin"</i></b>);
+
+<b class="hl-keyword">try</b> {
+  SentenceModel model = <b class="hl-keyword">new</b> SentenceModel(modelIn);
+}
+<b class="hl-keyword">catch</b> (IOException e) {
+  e.printStackTrace();
+}
+<b class="hl-keyword">finally</b> {
+  <b class="hl-keyword">if</b> (modelIn != null) {
+    <b class="hl-keyword">try</b> {
+      modelIn.close();
+    }
+    <b class="hl-keyword">catch</b> (IOException e) {
+    }
+  }
+}
+               </pre><p>
+               After the model is loaded the SentenceDetectorME can be 
instantiated.
+               </p><pre class="programlisting">
+                               
+SentenceDetectorME sentenceDetector = <b class="hl-keyword">new</b> 
SentenceDetectorME(model);
+               </pre><p>
+               The Sentence Detector can output an array of Strings, where 
each String is one sentence.
+                               </p><pre class="programlisting">
+                               
+String sentences[] = sentenceDetector.sentDetect(<b class="hl-string"><i 
style="color:red">"  First sentence. Second sentence. "</i></b>);
+               </pre><p>
+               The result array now contains two entries. The first String is 
"First sentence." and the
+        second String is "Second sentence." The whitespace before, between and 
after the input String is removed.
+               The API also offers a method which simply returns the span of 
the sentence in the input string.
+               </p><pre class="programlisting">
+                               
+Span sentences[] = sentenceDetector.sentPosDetect(<b class="hl-string"><i 
style="color:red">"  First sentence. Second sentence. "</i></b>);
+               </pre><p>
+               The result array again contains two entries. The first span 
beings at index 2 and ends at
+            17. The second span begins at 18 and ends at 34. The utility 
method Span.getCoveredText can be used to create a substring which only covers 
the chars in the span.
+               </p>
+               </div>
+       </div>
+       <div class="section" title="Sentence Detector Training"><div 
class="titlepage"><div><div><h2 class="title" style="clear: both"><a 
name="tools.sentdetect.training"></a>Sentence Detector 
Training</h2></div></div></div>
+               
+               <p></p>
+               <div class="section" title="Training Tool"><div 
class="titlepage"><div><div><h3 class="title"><a 
name="tools.sentdetect.training.tool"></a>Training Tool</h3></div></div></div>
+               
+               <p>
+               OpenNLP has a command line tool which is used to train the 
models available from the model
+               download page on various corpora. The data must be converted to 
the OpenNLP Sentence Detector
+               training format. Which is one sentence per line. An empty line 
indicates a document boundary.
+               In case the document boundary is unknown, its recommended to 
have an empty line every few ten
+               sentences. Exactly like the output in the sample above.
+               Usage of the tool:
+               </p><pre class="screen">
+                               
+$ opennlp SentenceDetectorTrainer
+Usage: opennlp SentenceDetectorTrainer[.namefinder|.conllx|.pos] [-abbDict 
path] \
+               [-params paramsFile] [-iterations num] [-cutoff num] -model 
modelFile \
+               -lang language -data sampleData [-encoding charsetName]
+
+Arguments description:
+        -abbDict path
+                abbreviation dictionary in XML format.
+        -params paramsFile
+                training parameters file.
+        -iterations num
+                number of training iterations, ignored if -params is used.
+        -cutoff num
+                minimal number of times a feature must be seen, ignored if 
-params is used.
+        -model modelFile
+                output model file.
+        -lang language
+                language which is being processed.
+        -data sampleData
+                data to be used, usually a file name.
+        -encoding charsetName
+                encoding for reading and writing text, if absent the system 
default is used.
+       </pre><p>
+               To train an English sentence detector use the following command:
+        </p><pre class="screen">
+                               
+$ opennlp SentenceDetectorTrainer -model en-sent.bin -lang en -data 
en-sent.train -encoding UTF-8
+                        
+        </pre><p>
+            It should produce the following output:
+            </p><pre class="screen">
+                
+Indexing events using cutoff of 5
+
+       Computing event counts...  done. 4883 events
+       Indexing...  done.
+Sorting and merging events... done. Reduced 4883 events to 2945.
+Done indexing.
+Incorporating indexed data for training...  
+done.
+       Number of Event Tokens: 2945
+           Number of Outcomes: 2
+         Number of Predicates: 467
+...done.
+Computing model parameters...
+Performing 100 iterations.
+  1:  .. loglikelihood=-3384.6376826743144     0.38951464263772273
+  2:  .. loglikelihood=-2191.9266688597672     0.9397911120212984
+  3:  .. loglikelihood=-1645.8640771555981     0.9643661683391358
+  4:  .. loglikelihood=-1340.386303774519      0.9739913987302887
+  5:  .. loglikelihood=-1148.4141548519624     0.9748105672742167
+
+ ...&lt;skipping a bunch of iterations&gt;...
+
+ 95:  .. loglikelihood=-288.25556805874436     0.9834118369854598
+ 96:  .. loglikelihood=-287.2283680343481      0.9834118369854598
+ 97:  .. loglikelihood=-286.2174830344526      0.9834118369854598
+ 98:  .. loglikelihood=-285.222486981048       0.9834118369854598
+ 99:  .. loglikelihood=-284.24296917223916     0.9834118369854598
+100:  .. loglikelihood=-283.2785335773966      0.9834118369854598
+Wrote sentence detector model.
+Path: en-sent.bin
+
+               </pre><p>
+               </p>
+               </div>
+               <div class="section" title="Training API"><div 
class="titlepage"><div><div><h3 class="title"><a 
name="tools.sentdetect.training.api"></a>Training API</h3></div></div></div>
+               
+               <p>
+               The Sentence Detector also offers an API to train a new 
sentence detection model.
+               Basically three steps are necessary to train it:
+               </p><div class="itemizedlist"><ul class="itemizedlist" 
type="disc"><li class="listitem">
+                                       <p>The application must open a sample 
data stream</p>
+                               </li><li class="listitem">
+                                       <p>Call the SentenceDetectorME.train 
method</p>
+                               </li><li class="listitem">
+                                       <p>Save the SentenceModel to a file or 
directly use it</p>
+                               </li></ul></div><p>
+                       The following sample code illustrates these steps:
+                                       </p><pre class="programlisting">
+                               
+Charset charset = Charset.forName(<b class="hl-string"><i 
style="color:red">"UTF-8"</i></b>);                          
+ObjectStream&lt;String&gt; lineStream =
+  <b class="hl-keyword">new</b> PlainTextByLineStream(<b 
class="hl-keyword">new</b> FileInputStream(<b class="hl-string"><i 
style="color:red">"en-sent.train"</i></b>), charset);
+ObjectStream&lt;SentenceSample&gt; sampleStream = <b 
class="hl-keyword">new</b> SentenceSampleStream(lineStream);
+
+SentenceModel model;
+
+<b class="hl-keyword">try</b> {
+  model = SentenceDetectorME.train(<b class="hl-string"><i 
style="color:red">"en"</i></b>, sampleStream, true, null, 
TrainingParameters.defaultParams());
+}
+<b class="hl-keyword">finally</b> {
+  sampleStream.close();
+}
+
+OutputStream modelOut = null;
+<b class="hl-keyword">try</b> {
+  modelOut = <b class="hl-keyword">new</b> BufferedOutputStream(<b 
class="hl-keyword">new</b> FileOutputStream(modelFile));
+  model.serialize(modelOut);
+} <b class="hl-keyword">finally</b> {
+  <b class="hl-keyword">if</b> (modelOut != null) 
+     modelOut.close();      
+}
+               </pre><p>
+               </p>
+               </div>
+       </div>
+       <div class="section" title="Evaluation"><div 
class="titlepage"><div><div><h2 class="title" style="clear: both"><a 
name="tools.sentdetect.eval"></a>Evaluation</h2></div></div></div>
+               
+               <p>
+               </p>
+               <div class="section" title="Evaluation Tool"><div 
class="titlepage"><div><div><h3 class="title"><a 
name="tools.sentdetect.eval.tool"></a>Evaluation Tool</h3></div></div></div>
+                       
+                       <p>
+                The command shows how the evaluator tool can be run:
+                </p><pre class="screen">
+                               
+$ opennlp SentenceDetectorEvaluator -model en-sent.bin -data en-sent.eval 
-encoding UTF-8
+
+Loading model ... done
+Evaluating ... done
+
+Precision: 0.9465737514518002
+Recall: 0.9095982142857143
+F-Measure: 0.9277177006260672
+                </pre><p>
+                The en-sent.eval file has the same format as the training data.
+                       </p>
+               </div>
+       </div>
+</div>
+       <div class="chapter" title="Chapter&nbsp;3.&nbsp;Tokenizer"><div 
class="titlepage"><div><div><h2 class="title"><a 
name="tools.tokenizer"></a>Chapter&nbsp;3.&nbsp;Tokenizer</h2></div></div></div><div
 class="toc"><p><b>Table of Contents</b></p><dl><dt><span class="section"><a 
href="#tools.tokenizer.introduction">Tokenization</a></span></dt><dd><dl><dt><span
 class="section"><a href="#tools.tokenizer.cmdline">Tokenizer 
Tools</a></span></dt><dt><span class="section"><a 
href="#tools.tokenizer.api">Tokenizer API</a></span></dt></dl></dd><dt><span 
class="section"><a href="#tools.tokenizer.training">Tokenizer 
Training</a></span></dt><dd><dl><dt><span class="section"><a 
href="#tools.tokenizer.training.tool">Training Tool</a></span></dt><dt><span 
class="section"><a href="#tools.tokenizer.training.api">Training 
API</a></span></dt></dl></dd><dt><span class="section"><a 
href="#tools.tokenizer.detokenizing">Detokenizing</a></span></dt><dd><dl><dt><span
 class="section"><a href="#tools.tokenizer.de
 tokenizing.api">Detokenizing API</a></span></dt><dt><span class="section"><a 
href="#tools.tokenizer.detokenizing.dict">Detokenizer 
Dictionary</a></span></dt></dl></dd></dl></div>
+
+       
+
+       <div class="section" title="Tokenization"><div 
class="titlepage"><div><div><h2 class="title" style="clear: both"><a 
name="tools.tokenizer.introduction"></a>Tokenization</h2></div></div></div>
+               
+               <p>
+                       The OpenNLP Tokenizers segment an input character 
sequence into
+                       tokens. Tokens are usually
+                       words, punctuation, numbers, etc.
+
+                       </p><pre class="screen">
+                       
+Pierre Vinken, 61 years old, will join the board as a nonexecutive director 
Nov. 29.
+Mr. Vinken is chairman of Elsevier N.V., the Dutch publishing group.
+Rudolph Agnew, 55 years old and former chairman of Consolidated Gold Fields
+    PLC, was named a director of this British industrial conglomerate.
+                       
+                   </pre><p>
+
+                       The following result shows the individual tokens in a 
whitespace
+                       separated representation.
+
+                       </p><pre class="screen">
+                       
+Pierre Vinken , 61 years old , will join the board as a nonexecutive director 
Nov. 29 .
+Mr. Vinken is chairman of Elsevier N.V. , the Dutch publishing group .
+Rudolph Agnew , 55 years old and former chairman of Consolidated Gold Fields 
PLC ,
+    was named a nonexecutive director of this British industrial conglomerate 
. 
+A form of asbestos once used to make Kent cigarette filters has caused a high
+    percentage of cancer deaths among a group of workers exposed to it more 
than 30 years ago ,
+    researchers reported . 
+                       
+                       </pre><p>
+
+                       OpenNLP offers multiple tokenizer implementations:
+                       </p><div class="itemizedlist"><ul class="itemizedlist" 
type="disc"><li class="listitem">
+                                       <p>Whitespace Tokenizer - A whitespace 
tokenizer, non whitespace
+                                               sequences are identified as 
tokens</p>
+                               </li><li class="listitem">
+                                       <p>Simple Tokenizer - A character class 
tokenizer, sequences of
+                                               the same character class are 
tokens</p>
+                               </li><li class="listitem">
+                                       <p>Learnable Tokenizer - A maximum 
entropy tokenizer, detects
+                                               token boundaries based on 
probability model</p>
+                               </li></ul></div><p>
+
+                       Most part-of-speech taggers, parsers and so on, work 
with text
+                       tokenized in this manner. It is important to ensure 
that your
+                       tokenizer
+                       produces tokens of the type expected by your later text
+                       processing
+                       components.
+               </p>
+
+               <p>
+                       With OpenNLP (as with many systems), tokenization is a 
two-stage
+                       process:
+                       first, sentence boundaries are identified, then tokens 
within
+                       each
+                       sentence are identified.
+               </p>
+       
+       <div class="section" title="Tokenizer Tools"><div 
class="titlepage"><div><div><h3 class="title"><a 
name="tools.tokenizer.cmdline"></a>Tokenizer Tools</h3></div></div></div>
+               
+               <p>The easiest way to try out the tokenizers are the command 
line
+                       tools. The tools are only intended for demonstration 
and testing.
+               </p>
+               <p>There are two tools, one for the Simple Tokenizer and one for
+                       the learnable tokenizer. A command line tool the for 
the Whitespace
+                       Tokenizer does not exist, because the whitespace 
separated output
+                       would be identical to the input.</p>
+               <p>
+                       The following command shows how to use the Simple 
Tokenizer Tool.
+
+                       </p><pre class="screen">
+                       
+$ opennlp SimpleTokenizer
+                   </pre><p>
+                       To use the learnable tokenizer download the english 
token model from
+                       our website.
+                       </p><pre class="screen">
+                       
+$ opennlp TokenizerME en-token.bin
+                   </pre><p>
+                       To test the tokenizer copy the sample from above to the 
console. The
+                       whitespace separated tokens will be written back to the
+                       console.
+               </p>
+               <p>
+                       Usually the input is read from a file and written to a 
file.
+                       </p><pre class="screen">
+                       
+$ opennlp TokenizerME en-token.bin &lt; article.txt &gt; article-tokenized.txt
+                   </pre><p>
+                       It can be done in the same way for the Simple Tokenizer.
+               </p>
+               <p>
+                       Since most text comes truly raw and doesn't have 
sentence boundaries
+                       and such, its possible to create a pipe which first 
performs sentence
+                       boundary detection and tokenization. The following 
sample illustrates
+                       that.
+                       </p><pre class="screen">
+                       
+$ opennlp SentenceDetector sentdetect.model &lt; article.txt | opennlp 
TokenizerME tokenize.model | more
+Loading model ... Loading model ... done
+done
+Showa Shell gained 20 to 1,570 and Mitsubishi Oil rose 50 to 1,500.
+Sumitomo Metal Mining fell five yen to 692 and Nippon Mining added 15 to 960 .
+Among other winners Wednesday was Nippon Shokubai , which was up 80 at 2,410 .
+Marubeni advanced 11 to 890 .
+London share prices were bolstered largely by continued gains on Wall Street 
and technical 
+    factors affecting demand for London 's blue-chip stocks .
+...etc...
+                </pre><p>
+                       Of course this is all on the command line. Many people 
use the models
+                       directly in their Java code by creating 
SentenceDetector and
+                       Tokenizer objects and calling their methods as 
appropriate. The
+                       following section will explain how the Tokenizers can 
be used
+                       directly from java.
+               </p>
+       </div>
+
+       <div class="section" title="Tokenizer API"><div 
class="titlepage"><div><div><h3 class="title"><a 
name="tools.tokenizer.api"></a>Tokenizer API</h3></div></div></div>
+               
+               <p>
+                       The Tokenizers can be integrated into an application by 
the defined
+                       API.
+                       The shared instance of the WhitespaceTokenizer can be 
retrieved from a
+                       static field WhitespaceTokenizer.INSTANCE. The shared 
instance of the
+                       SimpleTokenizer can be retrieved in the same way from
+                       SimpleTokenizer.INSTANCE.
+                       To instantiate the TokenizerME (the learnable 
tokenizer) a Token Model
+                       must be created first. The following code sample shows 
how a model
+                       can be loaded.
+                       </p><pre class="programlisting">
+                       
+InputStream modelIn = <b class="hl-keyword">new</b> FileInputStream(<b 
class="hl-string"><i style="color:red">"en-token.bin"</i></b>);
+
+<b class="hl-keyword">try</b> {
+  TokenizerModel model = <b class="hl-keyword">new</b> TokenizerModel(modelIn);
+}
+<b class="hl-keyword">catch</b> (IOException e) {
+  e.printStackTrace();
+}
+<b class="hl-keyword">finally</b> {
+  <b class="hl-keyword">if</b> (modelIn != null) {
+    <b class="hl-keyword">try</b> {
+      modelIn.close();
+    }
+    <b class="hl-keyword">catch</b> (IOException e) {
+    }
+  }
+}
+                </pre><p>
+                       After the model is loaded the TokenizerME can be 
instantiated.
+                       </p><pre class="programlisting">
+                       
+Tokenizer tokenizer = <b class="hl-keyword">new</b> TokenizerME(model);
+                </pre><p>
+                       The tokenizer offers two tokenize methods, both expect 
an input
+                       String object which contains the untokenized text. If 
possible it
+                       should be a sentence, but depending on the training of 
the learnable
+                       tokenizer this is not required. The first returns an 
array of
+                       Strings, where each String is one token.
+                       </p><pre class="programlisting">
+                       
+String tokens[] = tokenizer.tokenize(<b class="hl-string"><i 
style="color:red">"An input sample sentence."</i></b>);
+                </pre><p>
+                       The output will be an array with these tokens.
+                       </p><pre class="programlisting">
+                       
+"An", "input", "sample", "sentence", "."
+                </pre><p>
+                       The second method, tokenizePos returns an array of 
Spans, each Span
+                       contain the begin and end character offsets of the 
token in the input
+                       String.
+                       </p><pre class="programlisting">
+                       
+Span tokenSpans[] = tokenizer.tokenizePos(<b class="hl-string"><i 
style="color:red">"An input sample sentence."</i></b>);              
+                       </pre><p>
+                       The tokenSpans array now contain 5 elements. To get the 
text for one
+                       span call Span.getCoveredText which takes a span and 
the input text.
+
+                       The TokenizerME is able to output the probabilities for 
the detected
+                       tokens. The getTokenProbabilities method must be called 
directly
+                       after one of the tokenize methods was called.
+                       </p><pre class="programlisting">
+                       
+TokenizerME tokenizer = ...
+
+String tokens[] = tokenizer.tokenize(...);
+<b class="hl-keyword">double</b> tokenProbs[] = 
tokenizer.getTokenProbabilities();
+                       </pre><p>
+                       The tokenProbs array now contains one double value per 
token, the
+                       value is between 0 and 1, where 1 is the highest 
possible probability
+                       and 0 the lowest possible probability.
+               </p>
+       </div>
+       </div>
+       
+       <div class="section" title="Tokenizer Training"><div 
class="titlepage"><div><div><h2 class="title" style="clear: both"><a 
name="tools.tokenizer.training"></a>Tokenizer Training</h2></div></div></div>
+               
+                       
+               <div class="section" title="Training Tool"><div 
class="titlepage"><div><div><h3 class="title"><a 
name="tools.tokenizer.training.tool"></a>Training Tool</h3></div></div></div>
+                       
+                       <p>
+                               OpenNLP has a command line tool which is used 
to train the models
+                               available from the model download page on 
various corpora. The data
+                               can be converted to the OpenNLP Tokenizer 
training format or used directly.
+                The OpenNLP format contains one sentence per line. Tokens are 
either separated by a
+                whitespace or by a special &lt;SPLIT&gt; tag.
+                               
+                               The following sample shows the sample from 
above in the correct format.
+                               </p><pre class="screen">
+                           
+Pierre Vinken&lt;SPLIT&gt;, 61 years old&lt;SPLIT&gt;, will join the board as 
a nonexecutive director Nov. 29&lt;SPLIT&gt;.
+Mr. Vinken is chairman of Elsevier N.V.&lt;SPLIT&gt;, the Dutch publishing 
group&lt;SPLIT&gt;.
+Rudolph Agnew&lt;SPLIT&gt;, 55 years old and former chairman of Consolidated 
Gold Fields PLC&lt;SPLIT&gt;,
+    was named a nonexecutive director of this British industrial 
conglomerate&lt;SPLIT&gt;.
+                           </pre><p>
+                           Usage of the tool:
+                           </p><pre class="screen">
+                           
+$ opennlp TokenizerTrainer
+Usage: opennlp TokenizerTrainer[.namefinder|.conllx|.pos] [-abbDict path] \
+                [-alphaNumOpt isAlphaNumOpt] [-params paramsFile] [-iterations 
num] \
+                [-cutoff num] -model modelFile -lang language -data sampleData 
\
+                [-encoding charsetName]
+
+Arguments description:
+        -abbDict path
+                abbreviation dictionary in XML format.
+        -alphaNumOpt isAlphaNumOpt
+                Optimization flag to skip alpha numeric tokens for further 
tokenization
+        -params paramsFile
+                training parameters file.
+        -iterations num
+                number of training iterations, ignored if -params is used.
+        -cutoff num
+                minimal number of times a feature must be seen, ignored if 
-params is used.
+        -model modelFile
+                output model file.
+        -lang language
+                language which is being processed.
+        -data sampleData
+                data to be used, usually a file name.
+        -encoding charsetName
+                encoding for reading and writing text, if absent the system 
default is used.
+                </pre><p>
+                               To train the english tokenizer use the 
following command:
+                               </p><pre class="screen">
+                           
+$ opennlp TokenizerTrainer -model en-token.bin -alphaNumOpt -lang en -data 
en-token.train -encoding UTF-8
+
+Indexing events using cutoff of 5
+
+       Computing event counts...  done. 262271 events
+       Indexing...  done.
+Sorting and merging events... done. Reduced 262271 events to 59060.
+Done indexing.
+Incorporating indexed data for training...  
+done.
+       Number of Event Tokens: 59060
+           Number of Outcomes: 2
+         Number of Predicates: 15695
+...done.
+Computing model parameters...
+Performing 100 iterations.
+  1:  .. loglikelihood=-181792.40419263614     0.9614292087192255
+  2:  .. loglikelihood=-34208.094253153664     0.9629238459456059
+  3:  .. loglikelihood=-18784.123872910015     0.9729211388220581
+  4:  .. loglikelihood=-13246.88162585859      0.9856103038460219
+  5:  .. loglikelihood=-10209.262670265718     0.9894422181636552
+
+ ...&lt;skipping a bunch of iterations&gt;...
+
+ 95:  .. loglikelihood=-769.2107474529454      0.999511955191386
+ 96:  .. loglikelihood=-763.8891914534009      0.999511955191386
+ 97:  .. loglikelihood=-758.6685383254891      0.9995157680414533
+ 98:  .. loglikelihood=-753.5458314695236      0.9995157680414533
+ 99:  .. loglikelihood=-748.5182305519613      0.9995157680414533
+100:  .. loglikelihood=-743.5830058068038      0.9995157680414533
+Wrote tokenizer model.
+Path: en-token.bin
+                               </pre><p>
+                       </p>
+               </div>
+               <div class="section" title="Training API"><div 
class="titlepage"><div><div><h3 class="title"><a 
name="tools.tokenizer.training.api"></a>Training API</h3></div></div></div>
+                       
+            <p>
+                The Tokenizer offers an API to train a new tokenization model. 
Basically three steps
+                are necessary to train it:
+                </p><div class="itemizedlist"><ul class="itemizedlist" 
type="disc"><li class="listitem">
+                        <p>The application must open a sample data stream</p>
+                    </li><li class="listitem">
+                        <p>Call the TokenizerME.train method</p>
+                    </li><li class="listitem">
+                        <p>Save the TokenizerModel to a file or directly use 
it</p>
+                    </li></ul></div><p>
+                The following sample code illustrates these steps:
+                </p><pre class="programlisting">
+                    
+Charset charset = Charset.forName(<b class="hl-string"><i 
style="color:red">"UTF-8"</i></b>);
+ObjectStream&lt;String&gt; lineStream = <b class="hl-keyword">new</b> 
PlainTextByLineStream(<b class="hl-keyword">new</b> FileInputStream(<b 
class="hl-string"><i style="color:red">"en-sent.train"</i></b>),
+    charset);
+ObjectStream&lt;TokenSample&gt; sampleStream = <b class="hl-keyword">new</b> 
TokenSampleStream(lineStream);
+
+TokenizerModel model;
+
+<b class="hl-keyword">try</b> {
+  model = TokenizerME.train(<b class="hl-string"><i 
style="color:red">"en"</i></b>, sampleStream, true, 
TrainingParameters.defaultParams());
+}
+<b class="hl-keyword">finally</b> {
+  sampleStream.close();
+}
+
+OutputStream modelOut = null;
+<b class="hl-keyword">try</b> {
+  modelOut = <b class="hl-keyword">new</b> BufferedOutputStream(<b 
class="hl-keyword">new</b> FileOutputStream(modelFile));
+  model.serialize(modelOut);
+} <b class="hl-keyword">finally</b> {
+  <b class="hl-keyword">if</b> (modelOut != null)
+     modelOut.close();
+}
+                </pre><p>
+            </p>
+               </div>
+       </div>
+       
+       <div class="section" title="Detokenizing"><div 
class="titlepage"><div><div><h2 class="title" style="clear: both"><a 
name="tools.tokenizer.detokenizing"></a>Detokenizing</h2></div></div></div>
+               
+               <p>
+               Detokenizing is simple the opposite of tokenization, the 
original non-tokenized string should
+               be constructed out of a token sequence. The OpenNLP 
implementation was created to undo the tokenization
+               of training data for the tokenizer. It can also be used to undo 
the tokenization of such a trained
+               tokenizer. The implementation is strictly rule based and 
defines how tokens should be attached
+               to a sentence wise character sequence.
+               </p>
+               <p>
+               The rule dictionary assign to every token an operation which 
describes how it should be attached
+               to one continuous character sequence.
+               </p>
+               <p>
+               The following rules can be assigned to a token:
+               </p><div class="itemizedlist"><ul class="itemizedlist" 
type="disc"><li class="listitem">
+                               <p>MERGE_TO_LEFT - Merges the token to the left 
side.</p>
+                       </li><li class="listitem">
+                               <p>MERGE_TO_RIGHT - Merges the token to the 
right side.</p>
+                       </li><li class="listitem">
+                               <p>RIGHT_LEFT_MATCHING - Merges the token to 
the right side on first occurrence
+                               and to the left side on second occurrence.</p>
+                       </li></ul></div><p>
+
+               The following sample will illustrate how the detokenizer with a 
small
+               rule dictionary (illustration format, not the xml data format):
+               </p><pre class="programlisting">
+                       
+. MERGE_TO_LEFT
+" RIGHT_LEFT_MATCHING          
+               </pre><p>
+               The dictionary should be used to de-tokenize the following 
whitespace tokenized sentence:
+               </p><pre class="programlisting">
+                       
+He said " This is a test " .           
+               </pre><p>
+               The tokens would get these tags based on the dictionary:
+               </p><pre class="programlisting">
+                       
+He -&gt; NO_OPERATION
+said -&gt; NO_OPERATION
+" -&gt; MERGE_TO_RIGHT
+This -&gt; NO_OPERATION
+is -&gt; NO_OPERATION
+a -&gt; NO_OPERATION
+test -&gt; NO_OPERATION
+" -&gt; MERGE_TO_LEFT
+. -&gt; MERGE_TO_LEFT          
+                       </pre><p>
+                       That will result in the following character sequence:
+               </p><pre class="programlisting">
+                       
+He said "This is a test".              
+               </pre><p>
+               TODO: Add documentation about the dictionary format and how to 
use the API. Contributions are welcome.
+               </p>
+               <div class="section" title="Detokenizing API"><div 
class="titlepage"><div><div><h3 class="title"><a 
name="tools.tokenizer.detokenizing.api"></a>Detokenizing 
API</h3></div></div></div>
+                       
+                       <p>TODO: Write documentation about the detokenizer api. 
Any contributions
+are very welcome. If you want to contribute please contact us on the mailing 
list
+or comment on the jira issue <a class="ulink" 
href="https://issues.apache.org/jira/browse/OPENNLP-216"; 
target="_top">OPENNLP-216</a>.</p>
+               </div>
+               <div class="section" title="Detokenizer Dictionary"><div 
class="titlepage"><div><div><h3 class="title"><a 
name="tools.tokenizer.detokenizing.dict"></a>Detokenizer 
Dictionary</h3></div></div></div>
+                       
+                       <p>TODO: Write documentation about the detokenizer 
dictionary. Any contributions
+are very welcome. If you want to contribute please contact us on the mailing 
list
+or comment on the jira issue <a class="ulink" 
href="https://issues.apache.org/jira/browse/OPENNLP-217"; 
target="_top">OPENNLP-217</a>.</p>
+               </div>
+       </div>
+</div>
+       <div class="chapter" title="Chapter&nbsp;4.&nbsp;Name Finder"><div 
class="titlepage"><div><div><h2 class="title"><a 
name="tools.namefind"></a>Chapter&nbsp;4.&nbsp;Name 
Finder</h2></div></div></div><div class="toc"><p><b>Table of 
Contents</b></p><dl><dt><span class="section"><a 
href="#tools.namefind.recognition">Named Entity 
Recognition</a></span></dt><dd><dl><dt><span class="section"><a 
href="#tools.namefind.recognition.cmdline">Name Finder 
Tool</a></span></dt><dt><span class="section"><a 
href="#tools.namefind.recognition.api">Name Finder 
API</a></span></dt></dl></dd><dt><span class="section"><a 
href="#tools.namefind.training">Name Finder 
Training</a></span></dt><dd><dl><dt><span class="section"><a 
href="#tools.namefind.training.tool">Training Tool</a></span></dt><dt><span 
class="section"><a href="#tools.namefind.training.api">Training 
API</a></span></dt><dt><span class="section"><a 
href="#tools.namefind.training.featuregen">Custom Feature 
Generation</a></span></dt></dl></dd><dt><s
 pan class="section"><a 
href="#tools.namefind.eval">Evaluation</a></span></dt><dd><dl><dt><span 
class="section"><a href="#tools.namefind.eval.tool">Evaluation 
Tool</a></span></dt><dt><span class="section"><a 
href="#tools.namefind.eval.api">Evaluation 
API</a></span></dt></dl></dd><dt><span class="section"><a 
href="#tools.namefind.annotation_guides">Named Entity Annotation 
Guidelines</a></span></dt></dl></div>
+
+       
+
+       <div class="section" title="Named Entity Recognition"><div 
class="titlepage"><div><div><h2 class="title" style="clear: both"><a 
name="tools.namefind.recognition"></a>Named Entity 
Recognition</h2></div></div></div>
+               
+               <p>
+                       The Name Finder can detect named entities and numbers 
in text. To be able to
+                       detect entities the Name Finder needs a model. The 
model is dependent on the
+                       language and entity type it was trained for. The 
OpenNLP projects offers a number
+                       of pre-trained name finder models which are trained on 
various freely available corpora.
+                       They can be downloaded at our model download page. To 
find names in raw text the text
+                       must be segmented into tokens and sentences. A detailed 
description is given in the
+                       sentence detector and tokenizer tutorial. It is 
important that the tokenization for
+                       the training data and the input text is identical.
+               </p>
+       
+       <div class="section" title="Name Finder Tool"><div 
class="titlepage"><div><div><h3 class="title"><a 
name="tools.namefind.recognition.cmdline"></a>Name Finder 
Tool</h3></div></div></div>
+               
+               <p>
+                       The easiest way to try out the Name Finder is the 
command line tool.
+                       The tool is only intended for demonstration and 
testing. Download the
+                       English
+                       person model and start the Name Finder Tool with this 
command:
+                       </p><pre class="screen">
+                               
+$ opennlp TokenNameFinder en-ner-person.bin
+                        </pre><p>
+                        
+                       The name finder now reads a tokenized sentence per line 
from stdin, an empty
+                       line indicates a document boundary and resets the 
adaptive feature generators.
+                       Just copy this text to the terminal:
+       
+                       </p><pre class="screen">
+                               
+Pierre Vinken , 61 years old , will join the board as a nonexecutive director 
Nov. 29 .
+Mr . Vinken is chairman of Elsevier N.V. , the Dutch publishing group .
+Rudolph Agnew , 55 years old and former chairman of Consolidated Gold Fields 
PLC , was named
+    a director of this British industrial conglomerate .
+                        </pre><p>
+                        the name finder will now output the text with markup 
for person names:
+                       </p><pre class="screen">
+                               
+&lt;START:person&gt; Pierre Vinken &lt;END&gt; , 61 years old , will join the 
board as a nonexecutive director Nov. 29 .
+Mr . &lt;START:person&gt; Vinken &lt;END&gt; is chairman of Elsevier N.V. , 
the Dutch publishing group .
+&lt;START:person&gt; Rudolph Agnew &lt;END&gt; , 55 years old and former 
chairman of Consolidated Gold Fields PLC ,
+    was named a director of this British industrial conglomerate .
+                        </pre><p>
+               </p>
+       </div>
+               <div class="section" title="Name Finder API"><div 
class="titlepage"><div><div><h3 class="title"><a 
name="tools.namefind.recognition.api"></a>Name Finder API</h3></div></div></div>
+               
+               <p>
+                       To use the Name Finder in a production system it is 
strongly recommended to embed it
+                       directly into the application instead of using the 
command line interface.
+                       First the name finder model must be loaded into memory 
from disk or an other source.
+                       In the sample below it is loaded from disk.
+                       </p><pre class="programlisting">
+                               
+InputStream modelIn = <b class="hl-keyword">new</b> FileInputStream(<b 
class="hl-string"><i style="color:red">"en-ner-person.bin"</i></b>);
+
+<b class="hl-keyword">try</b> {
+  TokenNameFinderModel model = <b class="hl-keyword">new</b> 
TokenNameFinderModel(modelIn);
+}
+<b class="hl-keyword">catch</b> (IOException e) {
+  e.printStackTrace();
+}
+<b class="hl-keyword">finally</b> {
+  <b class="hl-keyword">if</b> (modelIn != null) {
+    <b class="hl-keyword">try</b> {
+      modelIn.close();
+    }
+    <b class="hl-keyword">catch</b> (IOException e) {
+    }
+  }
+}
+                        </pre><p>
+                        There is a number of reasons why the model loading can 
fail:
+                        </p><div class="itemizedlist"><ul class="itemizedlist" 
type="disc"><li class="listitem">
+                                       <p>Issues with the underlying I/O</p>
+                               </li><li class="listitem">
+                                       <p>The version of the model is not 
compatible with the OpenNLP version</p>
+                               </li><li class="listitem">
+                                       <p>The model is loaded into the wrong 
component,
+                                       for example a tokenizer model is loaded 
with TokenNameFinderModel class.</p>
+                               </li><li class="listitem">
+                                       <p>The model content is not valid for 
some other reason</p>
+                               </li></ul></div><p>
+                       After the model is loaded the NameFinderME can be 
instantiated.
+                       </p><pre class="programlisting">
+                               
+NameFinderME nameFinder = <b class="hl-keyword">new</b> NameFinderME(model);
+                       </pre><p>
+                       The initialization is now finished and the Name Finder 
can be used. The NameFinderME
+                       class is not thread safe, it must only be called from 
one thread. To use multiple threads
+                       multiple NameFinderME instances sharing the same model 
instance can be created.
+                       The input text should be segmented into documents, 
sentences and tokens.
+                       To perform entity detection an application calls the 
find method for every sentence in the
+                       document. After every document clearAdaptiveData must 
be called to clear the adaptive data in
+                       the feature generators. Not calling clearAdaptiveData 
can lead to a sharp drop in the detection
+                       rate after a few documents.
+                       The following code illustrates that:
+                       </p><pre class="programlisting">
+                               
+<b class="hl-keyword">for</b> (String document[][] : documents) {
+
+  <b class="hl-keyword">for</b> (String[] sentence : document) {
+    Span nameSpans[] = nameFinder.find(sentence);
+    <i class="hl-comment" style="color: silver">// do something with the 
names</i>
+  }
+
+  nameFinder.clearAdaptiveData()
+}
+                        </pre><p>
+                        the following snippet shows a call to find
+                        </p><pre class="programlisting">
+                               
+String sentence[] = <b class="hl-keyword">new</b> String[]{
+    <b class="hl-string"><i style="color:red">"Pierre"</i></b>,
+    <b class="hl-string"><i style="color:red">"Vinken"</i></b>,
+    <b class="hl-string"><i style="color:red">"is"</i></b>,
+    <b class="hl-string"><i style="color:red">"61"</i></b>,
+    <b class="hl-string"><i style="color:red">"years"</i></b>
+    <b class="hl-string"><i style="color:red">"old"</i></b>,
+    <b class="hl-string"><i style="color:red">"."</i></b>
+    };
+
+Span nameSpans[] = nameFinder.find(sentence);
+                       </pre><p>
+                       The nameSpans arrays contains now exactly one Span 
which marks the name Pierre Vinken. 
+                       The elements between the begin and end offsets are the 
name tokens. In this case the begin 
+                       offset is 0 and the end offset is 2. The Span object 
also knows the type of the entity.
+                       In this case it is person (defined by the model). It 
can be retrieved with a call to Span.getType().
+                       Additionally to the statistical Name Finder, OpenNLP 
also offers a dictionary and a regular
+                       expression name finder implementation.
+               </p>
+               <p>
+                       TODO: Explain how to retrieve probs from the name 
finder for names and for non recognized names
+               </p>
+       </div>
+       </div>
+       <div class="section" title="Name Finder Training"><div 
class="titlepage"><div><div><h2 class="title" style="clear: both"><a 
name="tools.namefind.training"></a>Name Finder Training</h2></div></div></div>
+               
+               <p>
+                       The pre-trained models might not be available for a 
desired language, can not detect
+                       important entities or the performance is not good 
enough outside the news domain.
+                       These are the typical reason to do custom training of 
the name finder on a new corpus
+                       or on a corpus which is extended by private training 
data taken from the data which should be analyzed.
+               </p>
+               
+               <div class="section" title="Training Tool"><div 
class="titlepage"><div><div><h3 class="title"><a 
name="tools.namefind.training.tool"></a>Training Tool</h3></div></div></div>
+               
+               <p>
+                       OpenNLP has a command line tool which is used to train 
the models available from the model
+                       download page on various corpora.
+               </p>
+               <p>
+                       The data can be converted to the OpenNLP name finder 
training format. Which is one
+            sentence per line. Some other formats are available as well.
+                       The sentence must be tokenized and contain spans which 
mark the entities. Documents are separated by
+                       empty lines which trigger the reset of the adaptive 
feature generators. A training file can contain
+                       multiple types. If the training file contains multiple 
types the created model will also be able to
+                       detect these multiple types.
+               </p>
+               <p>
+                       Sample sentence of the data:
+                       </p><pre class="screen">
+                               
+&lt;START:person&gt; Pierre Vinken &lt;END&gt; , 61 years old , will join the 
board as a nonexecutive director Nov. 29 .
+Mr . &lt;START:person&gt; Vinken &lt;END&gt; is chairman of Elsevier N.V. , 
the Dutch publishing group .
+                        </pre><p>
+                        The training data should contain at least 15000 
sentences to create a model which performs well.
+                        Usage of the tool:
+                       </p><pre class="screen">
+                               
+$ opennlp TokenNameFinderTrainer
+Usage: opennlp 
TokenNameFinderTrainer[.evalita|.ad|.conll03|.bionlp2004|.conll02|.muc6|.ontonotes|.brat]
 \
+[-featuregen featuregenFile] [-nameTypes types] [-sequenceCodec codec] 
[-factory factoryName] \
+[-resources resourcesDir] [-type modelType] [-params paramsFile] -lang 
language \
+-model modelFile -data sampleData [-encoding charsetName]
+
+Arguments description:
+        -featuregen featuregenFile
+                The feature generator descriptor file
+        -nameTypes types
+                name types to use for training
+        -sequenceCodec codec
+                sequence codec used to code name spans
+        -factory factoryName
+                A sub-class of TokenNameFinderFactory
+        -resources resourcesDir
+                The resources directory
+        -type modelType
+                The type of the token name finder model
+        -params paramsFile
+                training parameters file.
+        -lang language
+                language which is being processed.
+        -model modelFile
+                output model file.
+        -data sampleData
+                data to be used, usually a file name.
+        -encoding charsetName
+                encoding for reading and writing text, if absent the system 
default is used.
+                        </pre><p>
+                        It is now assumed that the english person name finder 
model should be trained from a file
+                        called en-ner-person.train which is encoded as UTF-8. 
The following command will train
+                        the name finder and write the model to 
en-ner-person.bin:
+                        </p><pre class="screen">
+                               
+$ opennlp TokenNameFinderTrainer -model en-ner-person.bin -lang en -data 
en-ner-person.train -encoding UTF-8
+                        </pre><p>
+The example above will train models with a pre-defined feature set. It is also 
possible to use the -resources parameter to generate features based on external 
knowledge such as those based on word representation (clustering) features. The 
external resources must all be placed in a resource directory which is then 
passed as a parameter. If this option is used it is then required to pass, via 
the -featuregen parameter, a XML custom feature generator which includes some 
of the clustering features shipped with the TokenNameFinder. Currently three 
formats of clustering lexicons are accepted:
+                       </p><div class="itemizedlist"><ul class="itemizedlist" 
type="disc"><li class="listitem">
+                                       <p>Space separated two column file 
specifying the token and the cluster class as generated by toolkits such as <a 
class="ulink" href="https://code.google.com/p/word2vec/"; 
target="_top">word2vec</a>.</p>
+                               </li><li class="listitem">
+                                       <p>Space separated three column file 
specifying the token, clustering class and weight as such as <a class="ulink" 
href="https://github.com/ninjin/clark_pos_induction"; target="_top">Clark's 
clusters</a>.</p>
+                               </li><li class="listitem">
+                                       <p>Tab separated three column Brown 
clusters as generated by <a class="ulink" 
href="https://github.com/percyliang/brown-cluster"; target="_top">
+                                               Liang's toolkit</a>.</p>
+                               </li></ul></div><p>
+                        Additionally it is possible to specify the number of 
iterations,
+                        the cutoff and to overwrite all types in the training 
data with a single type. Finally, the -sequenceCodec parameter allows to 
specify a BIO (Begin, Inside, Out) or BILOU (Begin, Inside, Last, Out, Unit) 
encoding to represent the Named Entities. An example of one such command would 
be as follows:
+                        </p><pre class="screen">
+                          
+$ opennlp TokenNameFinderTrainer -featuregen brown.xml -sequenceCodec BILOU 
-resources clusters/ \
+-params PerceptronTrainerParams.txt -lang en -model ner-test.bin -data 
en-train.opennlp -encoding UTF-8
+                        </pre><p>
+               </p>
+               </div>
+               <div class="section" title="Training API"><div 
class="titlepage"><div><div><h3 class="title"><a 
name="tools.namefind.training.api"></a>Training API</h3></div></div></div>
+               
+               <p>
+                       To train the name finder from within an application it 
is recommended to use the training
+                       API instead of the command line tool.
+                       Basically three steps are necessary to train it:
+                       </p><div class="itemizedlist"><ul class="itemizedlist" 
type="disc"><li class="listitem">
+                                       <p>The application must open a sample 
data stream</p>
+                               </li><li class="listitem">
+                                       <p>Call the NameFinderME.train 
method</p>
+                               </li><li class="listitem">
+                                       <p>Save the TokenNameFinderModel to a 
file or database</p>
+                               </li></ul></div><p>
+                       The three steps are illustrated by the following sample 
code:
+                       </p><pre class="programlisting">
+                               
+Charset charset = Charset.forName(<b class="hl-string"><i 
style="color:red">"UTF-8"</i></b>);
+ObjectStream&lt;String&gt; lineStream =
+               <b class="hl-keyword">new</b> PlainTextByLineStream(<b 
class="hl-keyword">new</b> FileInputStream(<b class="hl-string"><i 
style="color:red">"en-ner-person.train"</i></b>), charset);
+ObjectStream&lt;NameSample&gt; sampleStream = <b class="hl-keyword">new</b> 
NameSampleDataStream(lineStream);
+
+TokenNameFinderModel model;
+
+<b class="hl-keyword">try</b> {
+  model = NameFinderME.train(<b class="hl-string"><i 
style="color:red">"en"</i></b>, <b class="hl-string"><i 
style="color:red">"person"</i></b>, sampleStream, 
TrainingParameters.defaultParams(),
+            TokenNameFinderFactory nameFinderFactory);
+}
+<b class="hl-keyword">finally</b> {
+  sampleStream.close();
+}
+
+<b class="hl-keyword">try</b> {
+  modelOut = <b class="hl-keyword">new</b> BufferedOutputStream(<b 
class="hl-keyword">new</b> FileOutputStream(modelFile));
+  model.serialize(modelOut);
+} <b class="hl-keyword">finally</b> {
+  <b class="hl-keyword">if</b> (modelOut != null) 
+     modelOut.close();      
+}
+                        </pre><p>
+               </p>
+               </div>
+               
+               <div class="section" title="Custom Feature Generation"><div 
class="titlepage"><div><div><h3 class="title"><a 
name="tools.namefind.training.featuregen"></a>Custom Feature 
Generation</h3></div></div></div>
+               
+                       <p>
+                               OpenNLP defines a default feature generation 
which is used when no custom feature
+                               generation is specified. Users which want to 
experiment with the feature generation
+                               can provide a custom feature generator. Either 
via API or via an xml descriptor file.
+                       </p>
+                       <div class="section" title="Feature Generation defined 
by API"><div class="titlepage"><div><div><h4 class="title"><a 
name="tools.namefind.training.featuregen.api"></a>Feature Generation defined by 
API</h4></div></div></div>
+                       
+                       <p>
+                               The custom generator must be used for training
+                               and for detecting the names. If the feature 
generation during training time and detection
+                               time is different the name finder might not be 
able to detect names.
+                               The following lines show how to construct a 
custom feature generator
+                               </p><pre class="programlisting">
+                                       
+AdaptiveFeatureGenerator featureGenerator = <b class="hl-keyword">new</b> 
CachedFeatureGenerator(
+         <b class="hl-keyword">new</b> AdaptiveFeatureGenerator[]{
+           <b class="hl-keyword">new</b> WindowFeatureGenerator(<b 
class="hl-keyword">new</b> TokenFeatureGenerator(), <span 
class="hl-number">2</span>, <span class="hl-number">2</span>),
+           <b class="hl-keyword">new</b> WindowFeatureGenerator(<b 
class="hl-keyword">new</b> TokenClassFeatureGenerator(true), <span 
class="hl-number">2</span>, <span class="hl-number">2</span>),
+           <b class="hl-keyword">new</b> OutcomePriorFeatureGenerator(),
+           <b class="hl-keyword">new</b> PreviousMapFeatureGenerator(),
+           <b class="hl-keyword">new</b> BigramNameFeatureGenerator(),
+           <b class="hl-keyword">new</b> SentenceFeatureGenerator(true, false),
+           <b class="hl-keyword">new</b> 
BrownTokenFeatureGenerator(BrownCluster dictResource)
+           });
+                               </pre><p>
+                               which is similar to the default feature 
generator but with a BrownTokenFeature added.
+                               The javadoc of the feature generator classes 
explain what the individual feature generators do.
+                               To write a custom feature generator please 
implement the AdaptiveFeatureGenerator interface or
+                               if it must not be adaptive extend the 
FeatureGeneratorAdapter.
+                               The train method which should be used is 
defined as
+                               </p><pre class="programlisting">
+                                       
+<b class="hl-keyword">public</b> <b class="hl-keyword">static</b> 
TokenNameFinderModel train(String languageCode, String type,
+          ObjectStream&lt;NameSample&gt; samples, TrainingParameters 
trainParams,
+          TokenNameFinderFactory factory) <b class="hl-keyword">throws</b> 
IOException
+                               </pre><p>
+                               where the TokenNameFinderFactory allows to 
specify a custom feature generator.
+                               To detect names the model which was returned 
from the train method must be passed to the NameFinderME constructor.
+                               </p><pre class="programlisting">
+                                       
+<b class="hl-keyword">new</b> NameFinderME(model);
+                                </pre><p>       
+                       </p>
+                       </div>
+                       <div class="section" title="Feature Generation defined 
by XML Descriptor"><div class="titlepage"><div><div><h4 class="title"><a 
name="tools.namefind.training.featuregen.xml"></a>Feature Generation defined by 
XML Descriptor</h4></div></div></div>
+                       
+                       <p>
+                       OpenNLP can also use a xml descriptor file to configure 
the feature generation. The
+            descriptor
+                       file is stored inside the model after training and the 
feature generators are configured
+                       correctly when the name finder is instantiated.
+                       
+                       The following sample shows a xml descriptor which 
contains the default feature generator plus several types of clustering 
features:
+                               </p><pre class="programlisting">
+                                       
+<b class="hl-tag" style="color: #000096">&lt;generators&gt;</b>
+  <b class="hl-tag" style="color: #000096">&lt;cache&gt;</b> 
+    <b class="hl-tag" style="color: #000096">&lt;generators&gt;</b>
+      <b class="hl-tag" style="color: #000096">&lt;window</b> <span 
class="hl-attribute" style="color: #F5844C">prevLength</span> = <span 
class="hl-value" style="color: #993300">"2"</span> <span class="hl-attribute" 
style="color: #F5844C">nextLength</span> = <span class="hl-value" style="color: 
#993300">"2"</span><b class="hl-tag" style="color: #000096">&gt;</b>          
+        <b class="hl-tag" style="color: #000096">&lt;tokenclass/&gt;</b>
+      <b class="hl-tag" style="color: #000096">&lt;/window&gt;</b>
+      <b class="hl-tag" style="color: #000096">&lt;window</b> <span 
class="hl-attribute" style="color: #F5844C">prevLength</span> = <span 
class="hl-value" style="color: #993300">"2"</span> <span class="hl-attribute" 
style="color: #F5844C">nextLength</span> = <span class="hl-value" style="color: 
#993300">"2"</span><b class="hl-tag" style="color: #000096">&gt;</b>            
    
+

<TRUNCATED>
http://git-wip-us.apache.org/repos/asf/opennlp-site/blob/95530810/src/main/jbake/assets/android-icon-144x144.png
----------------------------------------------------------------------
diff --git a/src/main/jbake/assets/android-icon-144x144.png 
b/src/main/jbake/assets/android-icon-144x144.png
new file mode 100644
index 0000000..ee52085
Binary files /dev/null and b/src/main/jbake/assets/android-icon-144x144.png 
differ

Reply via email to