http://git-wip-us.apache.org/repos/asf/opennlp-site/blob/95530810/src/main/docs/1.7.2/manual/opennlp.html ---------------------------------------------------------------------- diff --git a/src/main/docs/1.7.2/manual/opennlp.html b/src/main/docs/1.7.2/manual/opennlp.html new file mode 100644 index 0000000..84dc967 --- /dev/null +++ b/src/main/docs/1.7.2/manual/opennlp.html @@ -0,0 +1,5388 @@ +<html><head> + <meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"> + <title>Apache OpenNLP Developer Documentation</title><link rel="stylesheet" href="css/opennlp-docs.css" type="text/css"><meta name="generator" content="DocBook XSL-NS Stylesheets V1.75.2"></head><body bgcolor="white" text="black" link="#0000FF" vlink="#840084" alink="#0000FF"><div lang="en" class="book" title="Apache OpenNLP Developer Documentation"><div class="titlepage"><div><div><h1 class="title"><a name="d4e1"></a>Apache OpenNLP Developer Documentation</h1></div><div><div class="authorgroup"> + <h3 class="corpauthor">Written and maintained by the Apache OpenNLP Development + Community</h3> + </div></div><div><p class="releaseinfo"> + Version 1.7.2 + </p></div><div><p class="copyright">Copyright © 2011, 2017 The Apache Software Foundation</p></div><div><div class="legalnotice" title="Legal Notice"><a name="d4e7"></a> + <p title="License and Disclaimer"> + <b>License and Disclaimer. </b> + + The ASF licenses this documentation + to you under the Apache License, + Version 2.0 (the + "License"); you may not use this documentation + except in compliance + with the License. You may obtain a copy of the + License at + + </p><div class="blockquote"><blockquote class="blockquote"> + <p> + <a class="ulink" href="http://www.apache.org/licenses/LICENSE-2.0" target="_top">http://www.apache.org/licenses/LICENSE-2.0</a> + </p> + </blockquote></div><p title="License and Disclaimer"> + + Unless required by applicable law or agreed to in writing, + this documentation and its contents are distributed under the License + on an + "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + KIND, either express or implied. See the License for the + specific language governing permissions and limitations + under the License. + + </p> + </div></div></div><hr></div><div class="toc"><p><b>Table of Contents</b></p><dl><dt><span class="chapter"><a href="#opennlp">1. Introduction</a></span></dt><dd><dl><dt><span class="section"><a href="#intro.description">Description</a></span></dt><dt><span class="section"><a href="#intro.general.library.structure">General Library Structure</a></span></dt><dt><span class="section"><a href="#intro.api">Application Program Interface (API). Generic Example</a></span></dt><dt><span class="section"><a href="#intro.cli">Command line interface (CLI)</a></span></dt><dd><dl><dt><span class="section"><a href="#intro.cli.description">Description</a></span></dt><dt><span class="section"><a href="#intro.cli.toolslist">List of tools</a></span></dt><dt><span class="section"><a href="#intro.cli.setup">Setting up</a></span></dt><dt><span class="section"><a href="#intro.cli.generic">Generic Example</a></span></dt></dl></dd></dl></dd><dt><span class="chapter"><a href="#tools.sentdetect">2. Sentence De tector</a></span></dt><dd><dl><dt><span class="section"><a href="#tools.sentdetect.detection">Sentence Detection</a></span></dt><dd><dl><dt><span class="section"><a href="#tools.sentdetect.detection.cmdline">Sentence Detection Tool</a></span></dt><dt><span class="section"><a href="#tools.sentdetect.detection.api">Sentence Detection API</a></span></dt></dl></dd><dt><span class="section"><a href="#tools.sentdetect.training">Sentence Detector Training</a></span></dt><dd><dl><dt><span class="section"><a href="#tools.sentdetect.training.tool">Training Tool</a></span></dt><dt><span class="section"><a href="#tools.sentdetect.training.api">Training API</a></span></dt></dl></dd><dt><span class="section"><a href="#tools.sentdetect.eval">Evaluation</a></span></dt><dd><dl><dt><span class="section"><a href="#tools.sentdetect.eval.tool">Evaluation Tool</a></span></dt></dl></dd></dl></dd><dt><span class="chapter"><a href="#tools.tokenizer">3. Tokenizer</a></span></dt><dd><dl><dt><span class="secti on"><a href="#tools.tokenizer.introduction">Tokenization</a></span></dt><dd><dl><dt><span class="section"><a href="#tools.tokenizer.cmdline">Tokenizer Tools</a></span></dt><dt><span class="section"><a href="#tools.tokenizer.api">Tokenizer API</a></span></dt></dl></dd><dt><span class="section"><a href="#tools.tokenizer.training">Tokenizer Training</a></span></dt><dd><dl><dt><span class="section"><a href="#tools.tokenizer.training.tool">Training Tool</a></span></dt><dt><span class="section"><a href="#tools.tokenizer.training.api">Training API</a></span></dt></dl></dd><dt><span class="section"><a href="#tools.tokenizer.detokenizing">Detokenizing</a></span></dt><dd><dl><dt><span class="section"><a href="#tools.tokenizer.detokenizing.api">Detokenizing API</a></span></dt><dt><span class="section"><a href="#tools.tokenizer.detokenizing.dict">Detokenizer Dictionary</a></span></dt></dl></dd></dl></dd><dt><span class="chapter"><a href="#tools.namefind">4. Name Finder</a></span></dt><dd><dl><d t><span class="section"><a href="#tools.namefind.recognition">Named Entity Recognition</a></span></dt><dd><dl><dt><span class="section"><a href="#tools.namefind.recognition.cmdline">Name Finder Tool</a></span></dt><dt><span class="section"><a href="#tools.namefind.recognition.api">Name Finder API</a></span></dt></dl></dd><dt><span class="section"><a href="#tools.namefind.training">Name Finder Training</a></span></dt><dd><dl><dt><span class="section"><a href="#tools.namefind.training.tool">Training Tool</a></span></dt><dt><span class="section"><a href="#tools.namefind.training.api">Training API</a></span></dt><dt><span class="section"><a href="#tools.namefind.training.featuregen">Custom Feature Generation</a></span></dt></dl></dd><dt><span class="section"><a href="#tools.namefind.eval">Evaluation</a></span></dt><dd><dl><dt><span class="section"><a href="#tools.namefind.eval.tool">Evaluation Tool</a></span></dt><dt><span class="section"><a href="#tools.namefind.eval.api">Evaluation AP I</a></span></dt></dl></dd><dt><span class="section"><a href="#tools.namefind.annotation_guides">Named Entity Annotation Guidelines</a></span></dt></dl></dd><dt><span class="chapter"><a href="#tools.doccat">5. Document Categorizer</a></span></dt><dd><dl><dt><span class="section"><a href="#tools.doccat.classifying">Classifying</a></span></dt><dd><dl><dt><span class="section"><a href="#tools.doccat.classifying.cmdline">Document Categorizer Tool</a></span></dt><dt><span class="section"><a href="#tools.doccat.classifying.api">Document Categorizer API</a></span></dt></dl></dd><dt><span class="section"><a href="#tools.doccat.training">Training</a></span></dt><dd><dl><dt><span class="section"><a href="#tools.doccat.training.tool">Training Tool</a></span></dt><dt><span class="section"><a href="#tools.doccat.training.api">Training API</a></span></dt></dl></dd></dl></dd><dt><span class="chapter"><a href="#tools.postagger">6. Part-of-Speech Tagger</a></span></dt><dd><dl><dt><span class="sectio n"><a href="#tools.postagger.tagging">Tagging</a></span></dt><dd><dl><dt><span class="section"><a href="#tools.postagger.tagging.cmdline">POS Tagger Tool</a></span></dt><dt><span class="section"><a href="#tools.postagger.tagging.api">POS Tagger API</a></span></dt></dl></dd><dt><span class="section"><a href="#tools.postagger.training">Training</a></span></dt><dd><dl><dt><span class="section"><a href="#tools.postagger.training.tool">Training Tool</a></span></dt><dt><span class="section"><a href="#tools.postagger.training.api">Training API</a></span></dt><dt><span class="section"><a href="#tools.postagger.training.tagdict">Tag Dictionary</a></span></dt></dl></dd><dt><span class="section"><a href="#tools.postagger.eval">Evaluation</a></span></dt><dd><dl><dt><span class="section"><a href="#tools.postagger.eval.tool">Evaluation Tool</a></span></dt></dl></dd></dl></dd><dt><span class="chapter"><a href="#tools.lemmatizer">7. Lemmatizer</a></span></dt><dd><dl><dt><span class="section"><a hre f="#tools.lemmatizer.tagging.cmdline">Lemmatizer Tool</a></span></dt><dt><span class="section"><a href="#tools.lemmatizer.tagging.api">Lemmatizer API</a></span></dt><dt><span class="section"><a href="#tools.lemmatizer.training">Lemmatizer Training</a></span></dt><dd><dl><dt><span class="section"><a href="#tools.lemmatizer.training.tool">Training Tool</a></span></dt><dt><span class="section"><a href="#tools.lemmatizer.training.api">Training API</a></span></dt></dl></dd><dt><span class="section"><a href="#tools.lemmatizer.evaluation">Lemmatizer Evaluation</a></span></dt></dl></dd><dt><span class="chapter"><a href="#tools.chunker">8. Chunker</a></span></dt><dd><dl><dt><span class="section"><a href="#tools.parser.chunking">Chunking</a></span></dt><dd><dl><dt><span class="section"><a href="#tools.parser.chunking.cmdline">Chunker Tool</a></span></dt><dt><span class="section"><a href="#tools.parser.chunking.api">Chunking API</a></span></dt></dl></dd><dt><span class="section"><a href="#tool s.chunker.training">Chunker Training</a></span></dt><dd><dl><dt><span class="section"><a href="#tools.chunker.training.tool">Training Tool</a></span></dt><dt><span class="section"><a href="#tools.chunker.training.api">Training API</a></span></dt></dl></dd><dt><span class="section"><a href="#tools.chunker.evaluation">Chunker Evaluation</a></span></dt><dd><dl><dt><span class="section"><a href="#tools.chunker.evaluation.tool">Chunker Evaluation Tool</a></span></dt></dl></dd></dl></dd><dt><span class="chapter"><a href="#tools.parser">9. Parser</a></span></dt><dd><dl><dt><span class="section"><a href="#tools.parser.parsing">Parsing</a></span></dt><dd><dl><dt><span class="section"><a href="#tools.parser.parsing.cmdline">Parser Tool</a></span></dt><dt><span class="section"><a href="#tools.parser.parsing.api">Parsing API</a></span></dt></dl></dd><dt><span class="section"><a href="#tools.parser.training">Parser Training</a></span></dt><dd><dl><dt><span class="section"><a href="#tools.parser. training.tool">Training Tool</a></span></dt><dt><span class="section"><a href="#tools.parser.training.api">Training API</a></span></dt></dl></dd><dt><span class="section"><a href="#tools.parser.evaluation">Parser Evaluation</a></span></dt><dd><dl><dt><span class="section"><a href="#tools.parser.evaluation.tool">Parser Evaluation Tool</a></span></dt><dt><span class="section"><a href="#tools.parser.evaluation.api">Evaluation API</a></span></dt></dl></dd></dl></dd><dt><span class="chapter"><a href="#tools.coref">10. Coreference Resolution</a></span></dt><dt><span class="chapter"><a href="#tools.extension">11. Extending OpenNLP</a></span></dt><dd><dl><dt><span class="section"><a href="#tools.extension.writing">Writing an extension</a></span></dt><dt><span class="section"><a href="#tools.extension.osgi">Running in an OSGi container</a></span></dt></dl></dd><dt><span class="chapter"><a href="#tools.corpora">12. Corpora</a></span></dt><dd><dl><dt><span class="section"><a href="#tools.corpo ra.conll">CONLL</a></span></dt><dd><dl><dt><span class="section"><a href="#tools.corpora.conll.2000">CONLL 2000</a></span></dt><dt><span class="section"><a href="#tools.corpora.conll.2002">CONLL 2002</a></span></dt><dt><span class="section"><a href="#tools.corpora.conll.2003">CONLL 2003</a></span></dt></dl></dd><dt><span class="section"><a href="#tools.corpora.arvores-deitadas">Arvores Deitadas</a></span></dt><dd><dl><dt><span class="section"><a href="#tools.corpora.arvores-deitadas.getting">Getting the data</a></span></dt><dt><span class="section"><a href="#tools.corpora.arvores-deitadas.converting">Converting the data (optional)</a></span></dt><dt><span class="section"><a href="#tools.corpora.arvores-deitadas.evaluation">Training and Evaluation</a></span></dt></dl></dd><dt><span class="section"><a href="#tools.corpora.leipzig">Leipzig Corpora</a></span></dt><dt><span class="section"><a href="#tools.corpora.ontonotes">OntoNotes Release 4.0</a></span></dt><dd><dl><dt><span class="se ction"><a href="#tools.corpora.ontonotes.namefinder">Name Finder Training</a></span></dt></dl></dd><dt><span class="section"><a href="#tools.corpora.brat">Brat Format Support</a></span></dt><dd><dl><dt><span class="section"><a href="#tools.corpora.brat.webtool">Sentences and Tokens</a></span></dt><dt><span class="section"><a href="#tools.corpora.brat.training">Training</a></span></dt><dt><span class="section"><a href="#tools.corpora.brat.evaluation">Evaluation</a></span></dt><dt><span class="section"><a href="#tools.corpora.brat.cross-validation">Cross Validation</a></span></dt></dl></dd></dl></dd><dt><span class="chapter"><a href="#opennlp.ml">13. Machine Learning</a></span></dt><dd><dl><dt><span class="section"><a href="#opennlp.ml.maxent">Maximum Entropy</a></span></dt><dd><dl><dt><span class="section"><a href="#opennlp.ml.maxent.impl">Implementation</a></span></dt></dl></dd></dl></dd><dt><span class="chapter"><a href="#org.apche.opennlp.uima">14. UIMA Integration</a></span></dt> <dd><dl><dt><span class="section"><a href="#org.apche.opennlp.running-pear-sample">Running the pear sample in CVD</a></span></dt><dt><span class="section"><a href="#org.apche.opennlp.further-help">Further Help</a></span></dt></dl></dd><dt><span class="chapter"><a href="#tools.morfologik-addon">15. Morfologik Addon</a></span></dt><dd><dl><dt><span class="section"><a href="#tools.morfologik-addon.api">Morfologik Integration</a></span></dt><dt><span class="section"><a href="#tools.morfologik-addon.cmdline">Morfologik CLI Tools</a></span></dt></dl></dd><dt><span class="chapter"><a href="#tools.cli">16. The Command Line Interface</a></span></dt><dd><dl><dt><span class="section"><a href="#tools.cli.doccat">Doccat</a></span></dt><dd><dl><dt><span class="section"><a href="#tools.cli.doccat.Doccat">Doccat</a></span></dt><dt><span class="section"><a href="#tools.cli.doccat.DoccatTrainer">DoccatTrainer</a></span></dt><dt><span class="section"><a href="#tools.cli.doccat.DoccatEvaluator">DoccatE valuator</a></span></dt><dt><span class="section"><a href="#tools.cli.doccat.DoccatCrossValidator">DoccatCrossValidator</a></span></dt><dt><span class="section"><a href="#tools.cli.doccat.DoccatConverter">DoccatConverter</a></span></dt></dl></dd><dt><span class="section"><a href="#tools.cli.dictionary">Dictionary</a></span></dt><dd><dl><dt><span class="section"><a href="#tools.cli.dictionary.DictionaryBuilder">DictionaryBuilder</a></span></dt></dl></dd><dt><span class="section"><a href="#tools.cli.tokenizer">Tokenizer</a></span></dt><dd><dl><dt><span class="section"><a href="#tools.cli.tokenizer.SimpleTokenizer">SimpleTokenizer</a></span></dt><dt><span class="section"><a href="#tools.cli.tokenizer.TokenizerME">TokenizerME</a></span></dt><dt><span class="section"><a href="#tools.cli.tokenizer.TokenizerTrainer">TokenizerTrainer</a></span></dt><dt><span class="section"><a href="#tools.cli.tokenizer.TokenizerMEEvaluator">TokenizerMEEvaluator</a></span></dt><dt><span class="section"><a h ref="#tools.cli.tokenizer.TokenizerCrossValidator">TokenizerCrossValidator</a></span></dt><dt><span class="section"><a href="#tools.cli.tokenizer.TokenizerConverter">TokenizerConverter</a></span></dt><dt><span class="section"><a href="#tools.cli.tokenizer.DictionaryDetokenizer">DictionaryDetokenizer</a></span></dt></dl></dd><dt><span class="section"><a href="#tools.cli.sentdetect">Sentdetect</a></span></dt><dd><dl><dt><span class="section"><a href="#tools.cli.sentdetect.SentenceDetector">SentenceDetector</a></span></dt><dt><span class="section"><a href="#tools.cli.sentdetect.SentenceDetectorTrainer">SentenceDetectorTrainer</a></span></dt><dt><span class="section"><a href="#tools.cli.sentdetect.SentenceDetectorEvaluator">SentenceDetectorEvaluator</a></span></dt><dt><span class="section"><a href="#tools.cli.sentdetect.SentenceDetectorCrossValidator">SentenceDetectorCrossValidator</a></span></dt><dt><span class="section"><a href="#tools.cli.sentdetect.SentenceDetectorConverter">Sentenc eDetectorConverter</a></span></dt></dl></dd><dt><span class="section"><a href="#tools.cli.namefind">Namefind</a></span></dt><dd><dl><dt><span class="section"><a href="#tools.cli.namefind.TokenNameFinder">TokenNameFinder</a></span></dt><dt><span class="section"><a href="#tools.cli.namefind.TokenNameFinderTrainer">TokenNameFinderTrainer</a></span></dt><dt><span class="section"><a href="#tools.cli.namefind.TokenNameFinderEvaluator">TokenNameFinderEvaluator</a></span></dt><dt><span class="section"><a href="#tools.cli.namefind.TokenNameFinderCrossValidator">TokenNameFinderCrossValidator</a></span></dt><dt><span class="section"><a href="#tools.cli.namefind.TokenNameFinderConverter">TokenNameFinderConverter</a></span></dt><dt><span class="section"><a href="#tools.cli.namefind.CensusDictionaryCreator">CensusDictionaryCreator</a></span></dt></dl></dd><dt><span class="section"><a href="#tools.cli.postag">Postag</a></span></dt><dd><dl><dt><span class="section"><a href="#tools.cli.postag.POSTag ger">POSTagger</a></span></dt><dt><span class="section"><a href="#tools.cli.postag.POSTaggerTrainer">POSTaggerTrainer</a></span></dt><dt><span class="section"><a href="#tools.cli.postag.POSTaggerEvaluator">POSTaggerEvaluator</a></span></dt><dt><span class="section"><a href="#tools.cli.postag.POSTaggerCrossValidator">POSTaggerCrossValidator</a></span></dt><dt><span class="section"><a href="#tools.cli.postag.POSTaggerConverter">POSTaggerConverter</a></span></dt></dl></dd><dt><span class="section"><a href="#tools.cli.lemmatizer">Lemmatizer</a></span></dt><dd><dl><dt><span class="section"><a href="#tools.cli.lemmatizer.LemmatizerME">LemmatizerME</a></span></dt><dt><span class="section"><a href="#tools.cli.lemmatizer.LemmatizerTrainerME">LemmatizerTrainerME</a></span></dt><dt><span class="section"><a href="#tools.cli.lemmatizer.LemmatizerEvaluator">LemmatizerEvaluator</a></span></dt></dl></dd><dt><span class="section"><a href="#tools.cli.chunker">Chunker</a></span></dt><dd><dl><dt><span class="section"><a href="#tools.cli.chunker.ChunkerME">ChunkerME</a></span></dt><dt><span class="section"><a href="#tools.cli.chunker.ChunkerTrainerME">ChunkerTrainerME</a></span></dt><dt><span class="section"><a href="#tools.cli.chunker.ChunkerEvaluator">ChunkerEvaluator</a></span></dt><dt><span class="section"><a href="#tools.cli.chunker.ChunkerCrossValidator">ChunkerCrossValidator</a></span></dt><dt><span class="section"><a href="#tools.cli.chunker.ChunkerConverter">ChunkerConverter</a></span></dt></dl></dd><dt><span class="section"><a href="#tools.cli.parser">Parser</a></span></dt><dd><dl><dt><span class="section"><a href="#tools.cli.parser.Parser">Parser</a></span></dt><dt><span class="section"><a href="#tools.cli.parser.ParserTrainer">ParserTrainer</a></span></dt><dt><span class="section"><a href="#tools.cli.parser.ParserEvaluator">ParserEvaluator</a></span></dt><dt><span class="section"><a href="#tools.cli.parser.ParserConverter">ParserConverter</a></span></dt><dt><span class ="section"><a href="#tools.cli.parser.BuildModelUpdater">BuildModelUpdater</a></span></dt><dt><span class="section"><a href="#tools.cli.parser.CheckModelUpdater">CheckModelUpdater</a></span></dt><dt><span class="section"><a href="#tools.cli.parser.TaggerModelReplacer">TaggerModelReplacer</a></span></dt></dl></dd><dt><span class="section"><a href="#tools.cli.entitylinker">Entitylinker</a></span></dt><dd><dl><dt><span class="section"><a href="#tools.cli.entitylinker.EntityLinker">EntityLinker</a></span></dt></dl></dd><dt><span class="section"><a href="#tools.cli.languagemodel">Languagemodel</a></span></dt><dd><dl><dt><span class="section"><a href="#tools.cli.languagemodel.LanguageModel">LanguageModel</a></span></dt></dl></dd></dl></dd></dl></div><div class="list-of-tables"><p><b>List of Tables</b></p><dl><dt>4.1. <a href="#d4e278">Generator elements</a></dt></dl></div> + + + + + <div class="chapter" title="Chapter 1. Introduction"><div class="titlepage"><div><div><h2 class="title"><a name="opennlp"></a>Chapter 1. Introduction</h2></div></div></div><div class="toc"><p><b>Table of Contents</b></p><dl><dt><span class="section"><a href="#intro.description">Description</a></span></dt><dt><span class="section"><a href="#intro.general.library.structure">General Library Structure</a></span></dt><dt><span class="section"><a href="#intro.api">Application Program Interface (API). Generic Example</a></span></dt><dt><span class="section"><a href="#intro.cli">Command line interface (CLI)</a></span></dt><dd><dl><dt><span class="section"><a href="#intro.cli.description">Description</a></span></dt><dt><span class="section"><a href="#intro.cli.toolslist">List of tools</a></span></dt><dt><span class="section"><a href="#intro.cli.setup">Setting up</a></span></dt><dt><span class="section"><a href="#intro.cli.generic">Generic Example</a></span></dt></dl></dd ></dl></div> + + <div class="section" title="Description"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="intro.description"></a>Description</h2></div></div></div> + + <p> + The Apache OpenNLP library is a machine learning based toolkit for the processing of natural language text. + It supports the most common NLP tasks, such as tokenization, sentence segmentation, + part-of-speech tagging, named entity extraction, chunking, parsing, and coreference resolution. + These tasks are usually required to build more advanced text processing services. + OpenNLP also included maximum entropy and perceptron based machine learning. + </p> + + <p> + The goal of the OpenNLP project will be to create a mature toolkit for the abovementioned tasks. + An additional goal is to provide a large number of pre-built models for a variety of languages, as + well as the annotated text resources that those models are derived from. + </p> + </div> + + <div class="section" title="General Library Structure"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="intro.general.library.structure"></a>General Library Structure</h2></div></div></div> + + <p>The Apache OpenNLP library contains several components, enabling one to build + a full natural language processing pipeline. These components + include: sentence detector, tokenizer, + name finder, document categorizer, part-of-speech tagger, chunker, parser, + coreference resolution. Components contain parts which enable one to execute the + respective natural language processing task, to train a model and often also to evaluate a + model. Each of these facilities is accessible via its application program + interface (API). In addition, a command line interface (CLI) is provided for convenience + of experiments and training. + </p> + </div> + + <div class="section" title="Application Program Interface (API). Generic Example"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="intro.api"></a>Application Program Interface (API). Generic Example</h2></div></div></div> + + <p> + OpenNLP components have similar APIs. Normally, to execute a task, + one should provide a model and an input. + </p> + <p> + A model is usually loaded by providing a FileInputStream with a model to a + constructor of the model class: + </p><pre class="programlisting"> + +InputStream modelIn = <b class="hl-keyword">new</b> FileInputStream(<b class="hl-string"><i style="color:red">"lang-model-name.bin"</i></b>); + +<b class="hl-keyword">try</b> { + SomeModel model = <b class="hl-keyword">new</b> SomeModel(modelIn); +} +<b class="hl-keyword">catch</b> (IOException e) { + <i class="hl-comment" style="color: silver">//handle the exception</i> +} +<b class="hl-keyword">finally</b> { + <b class="hl-keyword">if</b> (null != modelIn) { + <b class="hl-keyword">try</b> { + modelIn.close(); + } + <b class="hl-keyword">catch</b> (IOException e) { + } + } +} + </pre><p> + </p> + <p> + After the model is loaded the tool itself can be instantiated. + </p><pre class="programlisting"> + +ToolName toolName = <b class="hl-keyword">new</b> ToolName(model); + </pre><p> + After the tool is instantiated, the processing task can be executed. The input and the + output formats are specific to the tool, but often the output is an array of String, + and the input is a String or an array of String. + </p><pre class="programlisting"> + +String output[] = toolName.executeTask(<b class="hl-string"><i style="color:red">"This is a sample text."</i></b>); + </pre><p> + </p> + </div> + + <div class="section" title="Command line interface (CLI)"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="intro.cli"></a>Command line interface (CLI)</h2></div></div></div> + + <div class="section" title="Description"><div class="titlepage"><div><div><h3 class="title"><a name="intro.cli.description"></a>Description</h3></div></div></div> + + <p> + OpenNLP provides a command line script, serving as a unique entry point to all + included tools. The script is located in the bin directory of OpenNLP binary + distribution. Included are versions for Windows: opennlp.bat and Linux or + compatible systems: opennlp. + </p> + </div> + + <div class="section" title="List of tools"><div class="titlepage"><div><div><h3 class="title"><a name="intro.cli.toolslist"></a>List of tools</h3></div></div></div> + + <p> + The list of command line tools for Apache OpenNLP 1.7.2, + as well as a description of its arguments, is available at section <a class="xref" href="#tools.cli" title="Chapter 16. The Command Line Interface">Chapter 16, <i>The Command Line Interface</i></a>. + </p> + </div> + + <div class="section" title="Setting up"><div class="titlepage"><div><div><h3 class="title"><a name="intro.cli.setup"></a>Setting up</h3></div></div></div> + + <p> + OpenNLP script uses JAVA_CMD and JAVA_HOME variables to determine which command to + use to execute Java virtual machine. + </p> + <p> + OpenNLP script uses OPENNLP_HOME variable to determine the location of the binary + distribution of OpenNLP. It is recommended to point this variable to the binary + distribution of current OpenNLP version and update PATH variable to include + $OPENNLP_HOME/bin or %OPENNLP_HOME%\bin. + </p> + <p> + Such configuration allows calling OpenNLP conveniently. Examples below + suppose this configuration has been done. + </p> + </div> + + <div class="section" title="Generic Example"><div class="titlepage"><div><div><h3 class="title"><a name="intro.cli.generic"></a>Generic Example</h3></div></div></div> + + + <p> + Apache OpenNLP provides a common command line script to access all its tools: + </p><pre class="screen"> + +$ opennlp + </pre><p> + This script prints current version of the library and lists all available tools: + </p><pre class="screen"> + +OpenNLP <VERSION>. Usage: opennlp TOOL +where TOOL is one of: + Doccat learnable document categorizer + DoccatTrainer trainer for the learnable document categorizer + DoccatConverter converts leipzig data format to native OpenNLP format + DictionaryBuilder builds a new dictionary + SimpleTokenizer character class tokenizer + TokenizerME learnable tokenizer + TokenizerTrainer trainer for the learnable tokenizer + TokenizerMEEvaluator evaluator for the learnable tokenizer + TokenizerCrossValidator K-fold cross validator for the learnable tokenizer + TokenizerConverter converts foreign data formats (namefinder,conllx,pos) to native OpenNLP format + DictionaryDetokenizer + SentenceDetector learnable sentence detector + SentenceDetectorTrainer trainer for the learnable sentence detector + SentenceDetectorEvaluator evaluator for the learnable sentence detector + SentenceDetectorCrossValidator K-fold cross validator for the learnable sentence detector + SentenceDetectorConverter converts foreign data formats (namefinder,conllx,pos) to native OpenNLP format + TokenNameFinder learnable name finder + TokenNameFinderTrainer trainer for the learnable name finder + TokenNameFinderEvaluator Measures the performance of the NameFinder model with the reference data + TokenNameFinderCrossValidator K-fold cross validator for the learnable Name Finder + TokenNameFinderConverter converts foreign data formats (bionlp2004,conll03,conll02,ad) to native OpenNLP format + CensusDictionaryCreator Converts 1990 US Census names into a dictionary + POSTagger learnable part of speech tagger + POSTaggerTrainer trains a model for the part-of-speech tagger + POSTaggerEvaluator Measures the performance of the POS tagger model with the reference data + POSTaggerCrossValidator K-fold cross validator for the learnable POS tagger + POSTaggerConverter converts conllx data format to native OpenNLP format + ChunkerME learnable chunker + ChunkerTrainerME trainer for the learnable chunker + ChunkerEvaluator Measures the performance of the Chunker model with the reference data + ChunkerCrossValidator K-fold cross validator for the chunker + ChunkerConverter converts ad data format to native OpenNLP format + Parser performs full syntactic parsing + ParserTrainer trains the learnable parser + ParserEvaluator Measures the performance of the Parser model with the reference data + BuildModelUpdater trains and updates the build model in a parser model + CheckModelUpdater trains and updates the check model in a parser model + TaggerModelReplacer replaces the tagger model in a parser model +All tools print help when invoked with help parameter +Example: opennlp SimpleTokenizer help + + </pre><p> + </p> + <p>OpenNLP tools have similar command line structure and options. To discover tool + options, run it with no parameters: + </p><pre class="screen"> + +$ opennlp ToolName + </pre><p> + The tool will output two blocks of help. + </p> + <p> + The first block describes the general structure of this tool command line: + </p><pre class="screen"> + +Usage: opennlp TokenizerTrainer[.namefinder|.conllx|.pos] [-abbDict path] ... -model modelFile ... + </pre><p> + The general structure of this tool command line includes the obligatory tool name + (TokenizerTrainer), the optional format parameters ([.namefinder|.conllx|.pos]), + the optional parameters ([-abbDict path] ...), and the obligatory parameters + (-model modelFile ...). + </p> + <p> + The format parameters enable direct processing of non-native data without conversion. + Each format might have its own parameters, which are displayed if the tool is + executed without or with help parameter: + </p><pre class="screen"> + +$ opennlp TokenizerTrainer.conllx help + </pre><p> + </p><pre class="screen"> + +Usage: opennlp TokenizerTrainer.conllx [-abbDict path] [-alphaNumOpt isAlphaNumOpt] ... + +Arguments description: + -abbDict path + abbreviation dictionary in XML format. + ... + </pre><p> + To switch the tool to a specific format, add a dot and the format name after + the tool name: + </p><pre class="screen"> + +$ opennlp TokenizerTrainer.conllx -model en-pos.bin ... + </pre><p> + </p> + <p> + The second block of the help message describes the individual arguments: + </p><pre class="screen"> + +Arguments description: + -type maxent|perceptron|perceptron_sequence + The type of the token name finder model. One of maxent|perceptron|perceptron_sequence. + -dict dictionaryPath + The XML tag dictionary file + ... + </pre><p> + </p> + <p> + Most tools for processing need to be provided at least a model: + </p><pre class="screen"> + +$ opennlp ToolName lang-model-name.bin + </pre><p> + When tool is executed this way, the model is loaded and the tool is waiting for + the input from standard input. This input is processed and printed to standard + output. + </p> + <p>Alternative, or one should say, most commonly used way is to use console input and + output redirection options to provide also an input and an output files: + </p><pre class="screen"> + +$ opennlp ToolName lang-model-name.bin < input.txt > output.txt + </pre><p> + </p> + <p> + Most tools for model training need to be provided first a model name, + optionally some training options (such as model type, number of iterations), + and then the data. + </p> + <p> + A model name is just a file name. + </p> + <p> + Training options often include number of iterations, cutoff, + abbreviations dictionary or something else. Sometimes it is possible to provide these + options via training options file. In this case these options are ignored and the + ones from the file are used. + </p> + <p> + For the data one has to specify the location of the data (filename) and often + language and encoding. + </p> + <p> + A generic example of a command line to launch a tool trainer might be: + </p><pre class="screen"> + +$ opennlp ToolNameTrainer -model en-model-name.bin -lang en -data input.train -encoding UTF-8 + </pre><p> + or with a format: + </p><pre class="screen"> + +$ opennlp ToolNameTrainer.conll03 -model en-model-name.bin -lang en -data input.train \ + -types per -encoding UTF-8 + </pre><p> + </p> + <p>Most tools for model evaluation are similar to those for task execution, and + need to be provided fist a model name, optionally some evaluation options (such + as whether to print misclassified samples), and then the test data. A generic + example of a command line to launch an evaluation tool might be: + </p><pre class="screen"> + +$ opennlp ToolNameEvaluator -model en-model-name.bin -lang en -data input.test -encoding UTF-8 + </pre><p> + </p> + </div> + </div> + +</div> + <div class="chapter" title="Chapter 2. Sentence Detector"><div class="titlepage"><div><div><h2 class="title"><a name="tools.sentdetect"></a>Chapter 2. Sentence Detector</h2></div></div></div><div class="toc"><p><b>Table of Contents</b></p><dl><dt><span class="section"><a href="#tools.sentdetect.detection">Sentence Detection</a></span></dt><dd><dl><dt><span class="section"><a href="#tools.sentdetect.detection.cmdline">Sentence Detection Tool</a></span></dt><dt><span class="section"><a href="#tools.sentdetect.detection.api">Sentence Detection API</a></span></dt></dl></dd><dt><span class="section"><a href="#tools.sentdetect.training">Sentence Detector Training</a></span></dt><dd><dl><dt><span class="section"><a href="#tools.sentdetect.training.tool">Training Tool</a></span></dt><dt><span class="section"><a href="#tools.sentdetect.training.api">Training API</a></span></dt></dl></dd><dt><span class="section"><a href="#tools.sentdetect.eval">Evaluation</a></span></dt> <dd><dl><dt><span class="section"><a href="#tools.sentdetect.eval.tool">Evaluation Tool</a></span></dt></dl></dd></dl></div> + + + + <div class="section" title="Sentence Detection"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="tools.sentdetect.detection"></a>Sentence Detection</h2></div></div></div> + + <p> + The OpenNLP Sentence Detector can detect that a punctuation character + marks the end of a sentence or not. In this sense a sentence is defined + as the longest white space trimmed character sequence between two punctuation + marks. The first and last sentence make an exception to this rule. The first + non whitespace character is assumed to be the begin of a sentence, and the + last non whitespace character is assumed to be a sentence end. + The sample text below should be segmented into its sentences. + </p><pre class="screen"> + +Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov. 29. Mr. Vinken is +chairman of Elsevier N.V., the Dutch publishing group. Rudolph Agnew, 55 years +old and former chairman of Consolidated Gold Fields PLC, was named a director of this +British industrial conglomerate. + </pre><p> + After detecting the sentence boundaries each sentence is written in its own line. + </p><pre class="screen"> + +Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov. 29. +Mr. Vinken is chairman of Elsevier N.V., the Dutch publishing group. +Rudolph Agnew, 55 years old and former chairman of Consolidated Gold Fields PLC, + was named a director of this British industrial conglomerate. + </pre><p> + Usually Sentence Detection is done before the text is tokenized and that's the way the pre-trained models on the web site are trained, + but it is also possible to perform tokenization first and let the Sentence Detector process the already tokenized text. + The OpenNLP Sentence Detector cannot identify sentence boundaries based on the contents of the sentence. A prominent example is the first sentence in an article where the title is mistakenly identified to be the first part of the first sentence. + Most components in OpenNLP expect input which is segmented into sentences. + </p> + + <div class="section" title="Sentence Detection Tool"><div class="titlepage"><div><div><h3 class="title"><a name="tools.sentdetect.detection.cmdline"></a>Sentence Detection Tool</h3></div></div></div> + + <p> + The easiest way to try out the Sentence Detector is the command line tool. The tool is only intended for demonstration and testing. + Download the english sentence detector model and start the Sentence Detector Tool with this command: + </p><pre class="screen"> + +$ opennlp SentenceDetector en-sent.bin + </pre><p> + Just copy the sample text from above to the console. The Sentence Detector will read it and echo one sentence per line to the console. + Usually the input is read from a file and the output is redirected to another file. This can be achieved with the following command. + </p><pre class="screen"> + +$ opennlp SentenceDetector en-sent.bin < input.txt > output.txt + </pre><p> + For the english sentence model from the website the input text should not be tokenized. + </p> + </div> + <div class="section" title="Sentence Detection API"><div class="titlepage"><div><div><h3 class="title"><a name="tools.sentdetect.detection.api"></a>Sentence Detection API</h3></div></div></div> + + <p> + The Sentence Detector can be easily integrated into an application via its API. + To instantiate the Sentence Detector the sentence model must be loaded first. + </p><pre class="programlisting"> + +InputStream modelIn = <b class="hl-keyword">new</b> FileInputStream(<b class="hl-string"><i style="color:red">"en-sent.bin"</i></b>); + +<b class="hl-keyword">try</b> { + SentenceModel model = <b class="hl-keyword">new</b> SentenceModel(modelIn); +} +<b class="hl-keyword">catch</b> (IOException e) { + e.printStackTrace(); +} +<b class="hl-keyword">finally</b> { + <b class="hl-keyword">if</b> (modelIn != null) { + <b class="hl-keyword">try</b> { + modelIn.close(); + } + <b class="hl-keyword">catch</b> (IOException e) { + } + } +} + </pre><p> + After the model is loaded the SentenceDetectorME can be instantiated. + </p><pre class="programlisting"> + +SentenceDetectorME sentenceDetector = <b class="hl-keyword">new</b> SentenceDetectorME(model); + </pre><p> + The Sentence Detector can output an array of Strings, where each String is one sentence. + </p><pre class="programlisting"> + +String sentences[] = sentenceDetector.sentDetect(<b class="hl-string"><i style="color:red">" First sentence. Second sentence. "</i></b>); + </pre><p> + The result array now contains two entries. The first String is "First sentence." and the + second String is "Second sentence." The whitespace before, between and after the input String is removed. + The API also offers a method which simply returns the span of the sentence in the input string. + </p><pre class="programlisting"> + +Span sentences[] = sentenceDetector.sentPosDetect(<b class="hl-string"><i style="color:red">" First sentence. Second sentence. "</i></b>); + </pre><p> + The result array again contains two entries. The first span beings at index 2 and ends at + 17. The second span begins at 18 and ends at 34. The utility method Span.getCoveredText can be used to create a substring which only covers the chars in the span. + </p> + </div> + </div> + <div class="section" title="Sentence Detector Training"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="tools.sentdetect.training"></a>Sentence Detector Training</h2></div></div></div> + + <p></p> + <div class="section" title="Training Tool"><div class="titlepage"><div><div><h3 class="title"><a name="tools.sentdetect.training.tool"></a>Training Tool</h3></div></div></div> + + <p> + OpenNLP has a command line tool which is used to train the models available from the model + download page on various corpora. The data must be converted to the OpenNLP Sentence Detector + training format. Which is one sentence per line. An empty line indicates a document boundary. + In case the document boundary is unknown, its recommended to have an empty line every few ten + sentences. Exactly like the output in the sample above. + Usage of the tool: + </p><pre class="screen"> + +$ opennlp SentenceDetectorTrainer +Usage: opennlp SentenceDetectorTrainer[.namefinder|.conllx|.pos] [-abbDict path] \ + [-params paramsFile] [-iterations num] [-cutoff num] -model modelFile \ + -lang language -data sampleData [-encoding charsetName] + +Arguments description: + -abbDict path + abbreviation dictionary in XML format. + -params paramsFile + training parameters file. + -iterations num + number of training iterations, ignored if -params is used. + -cutoff num + minimal number of times a feature must be seen, ignored if -params is used. + -model modelFile + output model file. + -lang language + language which is being processed. + -data sampleData + data to be used, usually a file name. + -encoding charsetName + encoding for reading and writing text, if absent the system default is used. + </pre><p> + To train an English sentence detector use the following command: + </p><pre class="screen"> + +$ opennlp SentenceDetectorTrainer -model en-sent.bin -lang en -data en-sent.train -encoding UTF-8 + + </pre><p> + It should produce the following output: + </p><pre class="screen"> + +Indexing events using cutoff of 5 + + Computing event counts... done. 4883 events + Indexing... done. +Sorting and merging events... done. Reduced 4883 events to 2945. +Done indexing. +Incorporating indexed data for training... +done. + Number of Event Tokens: 2945 + Number of Outcomes: 2 + Number of Predicates: 467 +...done. +Computing model parameters... +Performing 100 iterations. + 1: .. loglikelihood=-3384.6376826743144 0.38951464263772273 + 2: .. loglikelihood=-2191.9266688597672 0.9397911120212984 + 3: .. loglikelihood=-1645.8640771555981 0.9643661683391358 + 4: .. loglikelihood=-1340.386303774519 0.9739913987302887 + 5: .. loglikelihood=-1148.4141548519624 0.9748105672742167 + + ...<skipping a bunch of iterations>... + + 95: .. loglikelihood=-288.25556805874436 0.9834118369854598 + 96: .. loglikelihood=-287.2283680343481 0.9834118369854598 + 97: .. loglikelihood=-286.2174830344526 0.9834118369854598 + 98: .. loglikelihood=-285.222486981048 0.9834118369854598 + 99: .. loglikelihood=-284.24296917223916 0.9834118369854598 +100: .. loglikelihood=-283.2785335773966 0.9834118369854598 +Wrote sentence detector model. +Path: en-sent.bin + + </pre><p> + </p> + </div> + <div class="section" title="Training API"><div class="titlepage"><div><div><h3 class="title"><a name="tools.sentdetect.training.api"></a>Training API</h3></div></div></div> + + <p> + The Sentence Detector also offers an API to train a new sentence detection model. + Basically three steps are necessary to train it: + </p><div class="itemizedlist"><ul class="itemizedlist" type="disc"><li class="listitem"> + <p>The application must open a sample data stream</p> + </li><li class="listitem"> + <p>Call the SentenceDetectorME.train method</p> + </li><li class="listitem"> + <p>Save the SentenceModel to a file or directly use it</p> + </li></ul></div><p> + The following sample code illustrates these steps: + </p><pre class="programlisting"> + +Charset charset = Charset.forName(<b class="hl-string"><i style="color:red">"UTF-8"</i></b>); +ObjectStream<String> lineStream = + <b class="hl-keyword">new</b> PlainTextByLineStream(<b class="hl-keyword">new</b> FileInputStream(<b class="hl-string"><i style="color:red">"en-sent.train"</i></b>), charset); +ObjectStream<SentenceSample> sampleStream = <b class="hl-keyword">new</b> SentenceSampleStream(lineStream); + +SentenceModel model; + +<b class="hl-keyword">try</b> { + model = SentenceDetectorME.train(<b class="hl-string"><i style="color:red">"en"</i></b>, sampleStream, true, null, TrainingParameters.defaultParams()); +} +<b class="hl-keyword">finally</b> { + sampleStream.close(); +} + +OutputStream modelOut = null; +<b class="hl-keyword">try</b> { + modelOut = <b class="hl-keyword">new</b> BufferedOutputStream(<b class="hl-keyword">new</b> FileOutputStream(modelFile)); + model.serialize(modelOut); +} <b class="hl-keyword">finally</b> { + <b class="hl-keyword">if</b> (modelOut != null) + modelOut.close(); +} + </pre><p> + </p> + </div> + </div> + <div class="section" title="Evaluation"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="tools.sentdetect.eval"></a>Evaluation</h2></div></div></div> + + <p> + </p> + <div class="section" title="Evaluation Tool"><div class="titlepage"><div><div><h3 class="title"><a name="tools.sentdetect.eval.tool"></a>Evaluation Tool</h3></div></div></div> + + <p> + The command shows how the evaluator tool can be run: + </p><pre class="screen"> + +$ opennlp SentenceDetectorEvaluator -model en-sent.bin -data en-sent.eval -encoding UTF-8 + +Loading model ... done +Evaluating ... done + +Precision: 0.9465737514518002 +Recall: 0.9095982142857143 +F-Measure: 0.9277177006260672 + </pre><p> + The en-sent.eval file has the same format as the training data. + </p> + </div> + </div> +</div> + <div class="chapter" title="Chapter 3. Tokenizer"><div class="titlepage"><div><div><h2 class="title"><a name="tools.tokenizer"></a>Chapter 3. Tokenizer</h2></div></div></div><div class="toc"><p><b>Table of Contents</b></p><dl><dt><span class="section"><a href="#tools.tokenizer.introduction">Tokenization</a></span></dt><dd><dl><dt><span class="section"><a href="#tools.tokenizer.cmdline">Tokenizer Tools</a></span></dt><dt><span class="section"><a href="#tools.tokenizer.api">Tokenizer API</a></span></dt></dl></dd><dt><span class="section"><a href="#tools.tokenizer.training">Tokenizer Training</a></span></dt><dd><dl><dt><span class="section"><a href="#tools.tokenizer.training.tool">Training Tool</a></span></dt><dt><span class="section"><a href="#tools.tokenizer.training.api">Training API</a></span></dt></dl></dd><dt><span class="section"><a href="#tools.tokenizer.detokenizing">Detokenizing</a></span></dt><dd><dl><dt><span class="section"><a href="#tools.tokenizer.de tokenizing.api">Detokenizing API</a></span></dt><dt><span class="section"><a href="#tools.tokenizer.detokenizing.dict">Detokenizer Dictionary</a></span></dt></dl></dd></dl></div> + + + + <div class="section" title="Tokenization"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="tools.tokenizer.introduction"></a>Tokenization</h2></div></div></div> + + <p> + The OpenNLP Tokenizers segment an input character sequence into + tokens. Tokens are usually + words, punctuation, numbers, etc. + + </p><pre class="screen"> + +Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov. 29. +Mr. Vinken is chairman of Elsevier N.V., the Dutch publishing group. +Rudolph Agnew, 55 years old and former chairman of Consolidated Gold Fields + PLC, was named a director of this British industrial conglomerate. + + </pre><p> + + The following result shows the individual tokens in a whitespace + separated representation. + + </p><pre class="screen"> + +Pierre Vinken , 61 years old , will join the board as a nonexecutive director Nov. 29 . +Mr. Vinken is chairman of Elsevier N.V. , the Dutch publishing group . +Rudolph Agnew , 55 years old and former chairman of Consolidated Gold Fields PLC , + was named a nonexecutive director of this British industrial conglomerate . +A form of asbestos once used to make Kent cigarette filters has caused a high + percentage of cancer deaths among a group of workers exposed to it more than 30 years ago , + researchers reported . + + </pre><p> + + OpenNLP offers multiple tokenizer implementations: + </p><div class="itemizedlist"><ul class="itemizedlist" type="disc"><li class="listitem"> + <p>Whitespace Tokenizer - A whitespace tokenizer, non whitespace + sequences are identified as tokens</p> + </li><li class="listitem"> + <p>Simple Tokenizer - A character class tokenizer, sequences of + the same character class are tokens</p> + </li><li class="listitem"> + <p>Learnable Tokenizer - A maximum entropy tokenizer, detects + token boundaries based on probability model</p> + </li></ul></div><p> + + Most part-of-speech taggers, parsers and so on, work with text + tokenized in this manner. It is important to ensure that your + tokenizer + produces tokens of the type expected by your later text + processing + components. + </p> + + <p> + With OpenNLP (as with many systems), tokenization is a two-stage + process: + first, sentence boundaries are identified, then tokens within + each + sentence are identified. + </p> + + <div class="section" title="Tokenizer Tools"><div class="titlepage"><div><div><h3 class="title"><a name="tools.tokenizer.cmdline"></a>Tokenizer Tools</h3></div></div></div> + + <p>The easiest way to try out the tokenizers are the command line + tools. The tools are only intended for demonstration and testing. + </p> + <p>There are two tools, one for the Simple Tokenizer and one for + the learnable tokenizer. A command line tool the for the Whitespace + Tokenizer does not exist, because the whitespace separated output + would be identical to the input.</p> + <p> + The following command shows how to use the Simple Tokenizer Tool. + + </p><pre class="screen"> + +$ opennlp SimpleTokenizer + </pre><p> + To use the learnable tokenizer download the english token model from + our website. + </p><pre class="screen"> + +$ opennlp TokenizerME en-token.bin + </pre><p> + To test the tokenizer copy the sample from above to the console. The + whitespace separated tokens will be written back to the + console. + </p> + <p> + Usually the input is read from a file and written to a file. + </p><pre class="screen"> + +$ opennlp TokenizerME en-token.bin < article.txt > article-tokenized.txt + </pre><p> + It can be done in the same way for the Simple Tokenizer. + </p> + <p> + Since most text comes truly raw and doesn't have sentence boundaries + and such, its possible to create a pipe which first performs sentence + boundary detection and tokenization. The following sample illustrates + that. + </p><pre class="screen"> + +$ opennlp SentenceDetector sentdetect.model < article.txt | opennlp TokenizerME tokenize.model | more +Loading model ... Loading model ... done +done +Showa Shell gained 20 to 1,570 and Mitsubishi Oil rose 50 to 1,500. +Sumitomo Metal Mining fell five yen to 692 and Nippon Mining added 15 to 960 . +Among other winners Wednesday was Nippon Shokubai , which was up 80 at 2,410 . +Marubeni advanced 11 to 890 . +London share prices were bolstered largely by continued gains on Wall Street and technical + factors affecting demand for London 's blue-chip stocks . +...etc... + </pre><p> + Of course this is all on the command line. Many people use the models + directly in their Java code by creating SentenceDetector and + Tokenizer objects and calling their methods as appropriate. The + following section will explain how the Tokenizers can be used + directly from java. + </p> + </div> + + <div class="section" title="Tokenizer API"><div class="titlepage"><div><div><h3 class="title"><a name="tools.tokenizer.api"></a>Tokenizer API</h3></div></div></div> + + <p> + The Tokenizers can be integrated into an application by the defined + API. + The shared instance of the WhitespaceTokenizer can be retrieved from a + static field WhitespaceTokenizer.INSTANCE. The shared instance of the + SimpleTokenizer can be retrieved in the same way from + SimpleTokenizer.INSTANCE. + To instantiate the TokenizerME (the learnable tokenizer) a Token Model + must be created first. The following code sample shows how a model + can be loaded. + </p><pre class="programlisting"> + +InputStream modelIn = <b class="hl-keyword">new</b> FileInputStream(<b class="hl-string"><i style="color:red">"en-token.bin"</i></b>); + +<b class="hl-keyword">try</b> { + TokenizerModel model = <b class="hl-keyword">new</b> TokenizerModel(modelIn); +} +<b class="hl-keyword">catch</b> (IOException e) { + e.printStackTrace(); +} +<b class="hl-keyword">finally</b> { + <b class="hl-keyword">if</b> (modelIn != null) { + <b class="hl-keyword">try</b> { + modelIn.close(); + } + <b class="hl-keyword">catch</b> (IOException e) { + } + } +} + </pre><p> + After the model is loaded the TokenizerME can be instantiated. + </p><pre class="programlisting"> + +Tokenizer tokenizer = <b class="hl-keyword">new</b> TokenizerME(model); + </pre><p> + The tokenizer offers two tokenize methods, both expect an input + String object which contains the untokenized text. If possible it + should be a sentence, but depending on the training of the learnable + tokenizer this is not required. The first returns an array of + Strings, where each String is one token. + </p><pre class="programlisting"> + +String tokens[] = tokenizer.tokenize(<b class="hl-string"><i style="color:red">"An input sample sentence."</i></b>); + </pre><p> + The output will be an array with these tokens. + </p><pre class="programlisting"> + +"An", "input", "sample", "sentence", "." + </pre><p> + The second method, tokenizePos returns an array of Spans, each Span + contain the begin and end character offsets of the token in the input + String. + </p><pre class="programlisting"> + +Span tokenSpans[] = tokenizer.tokenizePos(<b class="hl-string"><i style="color:red">"An input sample sentence."</i></b>); + </pre><p> + The tokenSpans array now contain 5 elements. To get the text for one + span call Span.getCoveredText which takes a span and the input text. + + The TokenizerME is able to output the probabilities for the detected + tokens. The getTokenProbabilities method must be called directly + after one of the tokenize methods was called. + </p><pre class="programlisting"> + +TokenizerME tokenizer = ... + +String tokens[] = tokenizer.tokenize(...); +<b class="hl-keyword">double</b> tokenProbs[] = tokenizer.getTokenProbabilities(); + </pre><p> + The tokenProbs array now contains one double value per token, the + value is between 0 and 1, where 1 is the highest possible probability + and 0 the lowest possible probability. + </p> + </div> + </div> + + <div class="section" title="Tokenizer Training"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="tools.tokenizer.training"></a>Tokenizer Training</h2></div></div></div> + + + <div class="section" title="Training Tool"><div class="titlepage"><div><div><h3 class="title"><a name="tools.tokenizer.training.tool"></a>Training Tool</h3></div></div></div> + + <p> + OpenNLP has a command line tool which is used to train the models + available from the model download page on various corpora. The data + can be converted to the OpenNLP Tokenizer training format or used directly. + The OpenNLP format contains one sentence per line. Tokens are either separated by a + whitespace or by a special <SPLIT> tag. + + The following sample shows the sample from above in the correct format. + </p><pre class="screen"> + +Pierre Vinken<SPLIT>, 61 years old<SPLIT>, will join the board as a nonexecutive director Nov. 29<SPLIT>. +Mr. Vinken is chairman of Elsevier N.V.<SPLIT>, the Dutch publishing group<SPLIT>. +Rudolph Agnew<SPLIT>, 55 years old and former chairman of Consolidated Gold Fields PLC<SPLIT>, + was named a nonexecutive director of this British industrial conglomerate<SPLIT>. + </pre><p> + Usage of the tool: + </p><pre class="screen"> + +$ opennlp TokenizerTrainer +Usage: opennlp TokenizerTrainer[.namefinder|.conllx|.pos] [-abbDict path] \ + [-alphaNumOpt isAlphaNumOpt] [-params paramsFile] [-iterations num] \ + [-cutoff num] -model modelFile -lang language -data sampleData \ + [-encoding charsetName] + +Arguments description: + -abbDict path + abbreviation dictionary in XML format. + -alphaNumOpt isAlphaNumOpt + Optimization flag to skip alpha numeric tokens for further tokenization + -params paramsFile + training parameters file. + -iterations num + number of training iterations, ignored if -params is used. + -cutoff num + minimal number of times a feature must be seen, ignored if -params is used. + -model modelFile + output model file. + -lang language + language which is being processed. + -data sampleData + data to be used, usually a file name. + -encoding charsetName + encoding for reading and writing text, if absent the system default is used. + </pre><p> + To train the english tokenizer use the following command: + </p><pre class="screen"> + +$ opennlp TokenizerTrainer -model en-token.bin -alphaNumOpt -lang en -data en-token.train -encoding UTF-8 + +Indexing events using cutoff of 5 + + Computing event counts... done. 262271 events + Indexing... done. +Sorting and merging events... done. Reduced 262271 events to 59060. +Done indexing. +Incorporating indexed data for training... +done. + Number of Event Tokens: 59060 + Number of Outcomes: 2 + Number of Predicates: 15695 +...done. +Computing model parameters... +Performing 100 iterations. + 1: .. loglikelihood=-181792.40419263614 0.9614292087192255 + 2: .. loglikelihood=-34208.094253153664 0.9629238459456059 + 3: .. loglikelihood=-18784.123872910015 0.9729211388220581 + 4: .. loglikelihood=-13246.88162585859 0.9856103038460219 + 5: .. loglikelihood=-10209.262670265718 0.9894422181636552 + + ...<skipping a bunch of iterations>... + + 95: .. loglikelihood=-769.2107474529454 0.999511955191386 + 96: .. loglikelihood=-763.8891914534009 0.999511955191386 + 97: .. loglikelihood=-758.6685383254891 0.9995157680414533 + 98: .. loglikelihood=-753.5458314695236 0.9995157680414533 + 99: .. loglikelihood=-748.5182305519613 0.9995157680414533 +100: .. loglikelihood=-743.5830058068038 0.9995157680414533 +Wrote tokenizer model. +Path: en-token.bin + </pre><p> + </p> + </div> + <div class="section" title="Training API"><div class="titlepage"><div><div><h3 class="title"><a name="tools.tokenizer.training.api"></a>Training API</h3></div></div></div> + + <p> + The Tokenizer offers an API to train a new tokenization model. Basically three steps + are necessary to train it: + </p><div class="itemizedlist"><ul class="itemizedlist" type="disc"><li class="listitem"> + <p>The application must open a sample data stream</p> + </li><li class="listitem"> + <p>Call the TokenizerME.train method</p> + </li><li class="listitem"> + <p>Save the TokenizerModel to a file or directly use it</p> + </li></ul></div><p> + The following sample code illustrates these steps: + </p><pre class="programlisting"> + +Charset charset = Charset.forName(<b class="hl-string"><i style="color:red">"UTF-8"</i></b>); +ObjectStream<String> lineStream = <b class="hl-keyword">new</b> PlainTextByLineStream(<b class="hl-keyword">new</b> FileInputStream(<b class="hl-string"><i style="color:red">"en-sent.train"</i></b>), + charset); +ObjectStream<TokenSample> sampleStream = <b class="hl-keyword">new</b> TokenSampleStream(lineStream); + +TokenizerModel model; + +<b class="hl-keyword">try</b> { + model = TokenizerME.train(<b class="hl-string"><i style="color:red">"en"</i></b>, sampleStream, true, TrainingParameters.defaultParams()); +} +<b class="hl-keyword">finally</b> { + sampleStream.close(); +} + +OutputStream modelOut = null; +<b class="hl-keyword">try</b> { + modelOut = <b class="hl-keyword">new</b> BufferedOutputStream(<b class="hl-keyword">new</b> FileOutputStream(modelFile)); + model.serialize(modelOut); +} <b class="hl-keyword">finally</b> { + <b class="hl-keyword">if</b> (modelOut != null) + modelOut.close(); +} + </pre><p> + </p> + </div> + </div> + + <div class="section" title="Detokenizing"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="tools.tokenizer.detokenizing"></a>Detokenizing</h2></div></div></div> + + <p> + Detokenizing is simple the opposite of tokenization, the original non-tokenized string should + be constructed out of a token sequence. The OpenNLP implementation was created to undo the tokenization + of training data for the tokenizer. It can also be used to undo the tokenization of such a trained + tokenizer. The implementation is strictly rule based and defines how tokens should be attached + to a sentence wise character sequence. + </p> + <p> + The rule dictionary assign to every token an operation which describes how it should be attached + to one continuous character sequence. + </p> + <p> + The following rules can be assigned to a token: + </p><div class="itemizedlist"><ul class="itemizedlist" type="disc"><li class="listitem"> + <p>MERGE_TO_LEFT - Merges the token to the left side.</p> + </li><li class="listitem"> + <p>MERGE_TO_RIGHT - Merges the token to the right side.</p> + </li><li class="listitem"> + <p>RIGHT_LEFT_MATCHING - Merges the token to the right side on first occurrence + and to the left side on second occurrence.</p> + </li></ul></div><p> + + The following sample will illustrate how the detokenizer with a small + rule dictionary (illustration format, not the xml data format): + </p><pre class="programlisting"> + +. MERGE_TO_LEFT +" RIGHT_LEFT_MATCHING + </pre><p> + The dictionary should be used to de-tokenize the following whitespace tokenized sentence: + </p><pre class="programlisting"> + +He said " This is a test " . + </pre><p> + The tokens would get these tags based on the dictionary: + </p><pre class="programlisting"> + +He -> NO_OPERATION +said -> NO_OPERATION +" -> MERGE_TO_RIGHT +This -> NO_OPERATION +is -> NO_OPERATION +a -> NO_OPERATION +test -> NO_OPERATION +" -> MERGE_TO_LEFT +. -> MERGE_TO_LEFT + </pre><p> + That will result in the following character sequence: + </p><pre class="programlisting"> + +He said "This is a test". + </pre><p> + TODO: Add documentation about the dictionary format and how to use the API. Contributions are welcome. + </p> + <div class="section" title="Detokenizing API"><div class="titlepage"><div><div><h3 class="title"><a name="tools.tokenizer.detokenizing.api"></a>Detokenizing API</h3></div></div></div> + + <p>TODO: Write documentation about the detokenizer api. Any contributions +are very welcome. If you want to contribute please contact us on the mailing list +or comment on the jira issue <a class="ulink" href="https://issues.apache.org/jira/browse/OPENNLP-216" target="_top">OPENNLP-216</a>.</p> + </div> + <div class="section" title="Detokenizer Dictionary"><div class="titlepage"><div><div><h3 class="title"><a name="tools.tokenizer.detokenizing.dict"></a>Detokenizer Dictionary</h3></div></div></div> + + <p>TODO: Write documentation about the detokenizer dictionary. Any contributions +are very welcome. If you want to contribute please contact us on the mailing list +or comment on the jira issue <a class="ulink" href="https://issues.apache.org/jira/browse/OPENNLP-217" target="_top">OPENNLP-217</a>.</p> + </div> + </div> +</div> + <div class="chapter" title="Chapter 4. Name Finder"><div class="titlepage"><div><div><h2 class="title"><a name="tools.namefind"></a>Chapter 4. Name Finder</h2></div></div></div><div class="toc"><p><b>Table of Contents</b></p><dl><dt><span class="section"><a href="#tools.namefind.recognition">Named Entity Recognition</a></span></dt><dd><dl><dt><span class="section"><a href="#tools.namefind.recognition.cmdline">Name Finder Tool</a></span></dt><dt><span class="section"><a href="#tools.namefind.recognition.api">Name Finder API</a></span></dt></dl></dd><dt><span class="section"><a href="#tools.namefind.training">Name Finder Training</a></span></dt><dd><dl><dt><span class="section"><a href="#tools.namefind.training.tool">Training Tool</a></span></dt><dt><span class="section"><a href="#tools.namefind.training.api">Training API</a></span></dt><dt><span class="section"><a href="#tools.namefind.training.featuregen">Custom Feature Generation</a></span></dt></dl></dd><dt><s pan class="section"><a href="#tools.namefind.eval">Evaluation</a></span></dt><dd><dl><dt><span class="section"><a href="#tools.namefind.eval.tool">Evaluation Tool</a></span></dt><dt><span class="section"><a href="#tools.namefind.eval.api">Evaluation API</a></span></dt></dl></dd><dt><span class="section"><a href="#tools.namefind.annotation_guides">Named Entity Annotation Guidelines</a></span></dt></dl></div> + + + + <div class="section" title="Named Entity Recognition"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="tools.namefind.recognition"></a>Named Entity Recognition</h2></div></div></div> + + <p> + The Name Finder can detect named entities and numbers in text. To be able to + detect entities the Name Finder needs a model. The model is dependent on the + language and entity type it was trained for. The OpenNLP projects offers a number + of pre-trained name finder models which are trained on various freely available corpora. + They can be downloaded at our model download page. To find names in raw text the text + must be segmented into tokens and sentences. A detailed description is given in the + sentence detector and tokenizer tutorial. It is important that the tokenization for + the training data and the input text is identical. + </p> + + <div class="section" title="Name Finder Tool"><div class="titlepage"><div><div><h3 class="title"><a name="tools.namefind.recognition.cmdline"></a>Name Finder Tool</h3></div></div></div> + + <p> + The easiest way to try out the Name Finder is the command line tool. + The tool is only intended for demonstration and testing. Download the + English + person model and start the Name Finder Tool with this command: + </p><pre class="screen"> + +$ opennlp TokenNameFinder en-ner-person.bin + </pre><p> + + The name finder now reads a tokenized sentence per line from stdin, an empty + line indicates a document boundary and resets the adaptive feature generators. + Just copy this text to the terminal: + + </p><pre class="screen"> + +Pierre Vinken , 61 years old , will join the board as a nonexecutive director Nov. 29 . +Mr . Vinken is chairman of Elsevier N.V. , the Dutch publishing group . +Rudolph Agnew , 55 years old and former chairman of Consolidated Gold Fields PLC , was named + a director of this British industrial conglomerate . + </pre><p> + the name finder will now output the text with markup for person names: + </p><pre class="screen"> + +<START:person> Pierre Vinken <END> , 61 years old , will join the board as a nonexecutive director Nov. 29 . +Mr . <START:person> Vinken <END> is chairman of Elsevier N.V. , the Dutch publishing group . +<START:person> Rudolph Agnew <END> , 55 years old and former chairman of Consolidated Gold Fields PLC , + was named a director of this British industrial conglomerate . + </pre><p> + </p> + </div> + <div class="section" title="Name Finder API"><div class="titlepage"><div><div><h3 class="title"><a name="tools.namefind.recognition.api"></a>Name Finder API</h3></div></div></div> + + <p> + To use the Name Finder in a production system it is strongly recommended to embed it + directly into the application instead of using the command line interface. + First the name finder model must be loaded into memory from disk or an other source. + In the sample below it is loaded from disk. + </p><pre class="programlisting"> + +InputStream modelIn = <b class="hl-keyword">new</b> FileInputStream(<b class="hl-string"><i style="color:red">"en-ner-person.bin"</i></b>); + +<b class="hl-keyword">try</b> { + TokenNameFinderModel model = <b class="hl-keyword">new</b> TokenNameFinderModel(modelIn); +} +<b class="hl-keyword">catch</b> (IOException e) { + e.printStackTrace(); +} +<b class="hl-keyword">finally</b> { + <b class="hl-keyword">if</b> (modelIn != null) { + <b class="hl-keyword">try</b> { + modelIn.close(); + } + <b class="hl-keyword">catch</b> (IOException e) { + } + } +} + </pre><p> + There is a number of reasons why the model loading can fail: + </p><div class="itemizedlist"><ul class="itemizedlist" type="disc"><li class="listitem"> + <p>Issues with the underlying I/O</p> + </li><li class="listitem"> + <p>The version of the model is not compatible with the OpenNLP version</p> + </li><li class="listitem"> + <p>The model is loaded into the wrong component, + for example a tokenizer model is loaded with TokenNameFinderModel class.</p> + </li><li class="listitem"> + <p>The model content is not valid for some other reason</p> + </li></ul></div><p> + After the model is loaded the NameFinderME can be instantiated. + </p><pre class="programlisting"> + +NameFinderME nameFinder = <b class="hl-keyword">new</b> NameFinderME(model); + </pre><p> + The initialization is now finished and the Name Finder can be used. The NameFinderME + class is not thread safe, it must only be called from one thread. To use multiple threads + multiple NameFinderME instances sharing the same model instance can be created. + The input text should be segmented into documents, sentences and tokens. + To perform entity detection an application calls the find method for every sentence in the + document. After every document clearAdaptiveData must be called to clear the adaptive data in + the feature generators. Not calling clearAdaptiveData can lead to a sharp drop in the detection + rate after a few documents. + The following code illustrates that: + </p><pre class="programlisting"> + +<b class="hl-keyword">for</b> (String document[][] : documents) { + + <b class="hl-keyword">for</b> (String[] sentence : document) { + Span nameSpans[] = nameFinder.find(sentence); + <i class="hl-comment" style="color: silver">// do something with the names</i> + } + + nameFinder.clearAdaptiveData() +} + </pre><p> + the following snippet shows a call to find + </p><pre class="programlisting"> + +String sentence[] = <b class="hl-keyword">new</b> String[]{ + <b class="hl-string"><i style="color:red">"Pierre"</i></b>, + <b class="hl-string"><i style="color:red">"Vinken"</i></b>, + <b class="hl-string"><i style="color:red">"is"</i></b>, + <b class="hl-string"><i style="color:red">"61"</i></b>, + <b class="hl-string"><i style="color:red">"years"</i></b> + <b class="hl-string"><i style="color:red">"old"</i></b>, + <b class="hl-string"><i style="color:red">"."</i></b> + }; + +Span nameSpans[] = nameFinder.find(sentence); + </pre><p> + The nameSpans arrays contains now exactly one Span which marks the name Pierre Vinken. + The elements between the begin and end offsets are the name tokens. In this case the begin + offset is 0 and the end offset is 2. The Span object also knows the type of the entity. + In this case it is person (defined by the model). It can be retrieved with a call to Span.getType(). + Additionally to the statistical Name Finder, OpenNLP also offers a dictionary and a regular + expression name finder implementation. + </p> + <p> + TODO: Explain how to retrieve probs from the name finder for names and for non recognized names + </p> + </div> + </div> + <div class="section" title="Name Finder Training"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="tools.namefind.training"></a>Name Finder Training</h2></div></div></div> + + <p> + The pre-trained models might not be available for a desired language, can not detect + important entities or the performance is not good enough outside the news domain. + These are the typical reason to do custom training of the name finder on a new corpus + or on a corpus which is extended by private training data taken from the data which should be analyzed. + </p> + + <div class="section" title="Training Tool"><div class="titlepage"><div><div><h3 class="title"><a name="tools.namefind.training.tool"></a>Training Tool</h3></div></div></div> + + <p> + OpenNLP has a command line tool which is used to train the models available from the model + download page on various corpora. + </p> + <p> + The data can be converted to the OpenNLP name finder training format. Which is one + sentence per line. Some other formats are available as well. + The sentence must be tokenized and contain spans which mark the entities. Documents are separated by + empty lines which trigger the reset of the adaptive feature generators. A training file can contain + multiple types. If the training file contains multiple types the created model will also be able to + detect these multiple types. + </p> + <p> + Sample sentence of the data: + </p><pre class="screen"> + +<START:person> Pierre Vinken <END> , 61 years old , will join the board as a nonexecutive director Nov. 29 . +Mr . <START:person> Vinken <END> is chairman of Elsevier N.V. , the Dutch publishing group . + </pre><p> + The training data should contain at least 15000 sentences to create a model which performs well. + Usage of the tool: + </p><pre class="screen"> + +$ opennlp TokenNameFinderTrainer +Usage: opennlp TokenNameFinderTrainer[.evalita|.ad|.conll03|.bionlp2004|.conll02|.muc6|.ontonotes|.brat] \ +[-featuregen featuregenFile] [-nameTypes types] [-sequenceCodec codec] [-factory factoryName] \ +[-resources resourcesDir] [-type modelType] [-params paramsFile] -lang language \ +-model modelFile -data sampleData [-encoding charsetName] + +Arguments description: + -featuregen featuregenFile + The feature generator descriptor file + -nameTypes types + name types to use for training + -sequenceCodec codec + sequence codec used to code name spans + -factory factoryName + A sub-class of TokenNameFinderFactory + -resources resourcesDir + The resources directory + -type modelType + The type of the token name finder model + -params paramsFile + training parameters file. + -lang language + language which is being processed. + -model modelFile + output model file. + -data sampleData + data to be used, usually a file name. + -encoding charsetName + encoding for reading and writing text, if absent the system default is used. + </pre><p> + It is now assumed that the english person name finder model should be trained from a file + called en-ner-person.train which is encoded as UTF-8. The following command will train + the name finder and write the model to en-ner-person.bin: + </p><pre class="screen"> + +$ opennlp TokenNameFinderTrainer -model en-ner-person.bin -lang en -data en-ner-person.train -encoding UTF-8 + </pre><p> +The example above will train models with a pre-defined feature set. It is also possible to use the -resources parameter to generate features based on external knowledge such as those based on word representation (clustering) features. The external resources must all be placed in a resource directory which is then passed as a parameter. If this option is used it is then required to pass, via the -featuregen parameter, a XML custom feature generator which includes some of the clustering features shipped with the TokenNameFinder. Currently three formats of clustering lexicons are accepted: + </p><div class="itemizedlist"><ul class="itemizedlist" type="disc"><li class="listitem"> + <p>Space separated two column file specifying the token and the cluster class as generated by toolkits such as <a class="ulink" href="https://code.google.com/p/word2vec/" target="_top">word2vec</a>.</p> + </li><li class="listitem"> + <p>Space separated three column file specifying the token, clustering class and weight as such as <a class="ulink" href="https://github.com/ninjin/clark_pos_induction" target="_top">Clark's clusters</a>.</p> + </li><li class="listitem"> + <p>Tab separated three column Brown clusters as generated by <a class="ulink" href="https://github.com/percyliang/brown-cluster" target="_top"> + Liang's toolkit</a>.</p> + </li></ul></div><p> + Additionally it is possible to specify the number of iterations, + the cutoff and to overwrite all types in the training data with a single type. Finally, the -sequenceCodec parameter allows to specify a BIO (Begin, Inside, Out) or BILOU (Begin, Inside, Last, Out, Unit) encoding to represent the Named Entities. An example of one such command would be as follows: + </p><pre class="screen"> + +$ opennlp TokenNameFinderTrainer -featuregen brown.xml -sequenceCodec BILOU -resources clusters/ \ +-params PerceptronTrainerParams.txt -lang en -model ner-test.bin -data en-train.opennlp -encoding UTF-8 + </pre><p> + </p> + </div> + <div class="section" title="Training API"><div class="titlepage"><div><div><h3 class="title"><a name="tools.namefind.training.api"></a>Training API</h3></div></div></div> + + <p> + To train the name finder from within an application it is recommended to use the training + API instead of the command line tool. + Basically three steps are necessary to train it: + </p><div class="itemizedlist"><ul class="itemizedlist" type="disc"><li class="listitem"> + <p>The application must open a sample data stream</p> + </li><li class="listitem"> + <p>Call the NameFinderME.train method</p> + </li><li class="listitem"> + <p>Save the TokenNameFinderModel to a file or database</p> + </li></ul></div><p> + The three steps are illustrated by the following sample code: + </p><pre class="programlisting"> + +Charset charset = Charset.forName(<b class="hl-string"><i style="color:red">"UTF-8"</i></b>); +ObjectStream<String> lineStream = + <b class="hl-keyword">new</b> PlainTextByLineStream(<b class="hl-keyword">new</b> FileInputStream(<b class="hl-string"><i style="color:red">"en-ner-person.train"</i></b>), charset); +ObjectStream<NameSample> sampleStream = <b class="hl-keyword">new</b> NameSampleDataStream(lineStream); + +TokenNameFinderModel model; + +<b class="hl-keyword">try</b> { + model = NameFinderME.train(<b class="hl-string"><i style="color:red">"en"</i></b>, <b class="hl-string"><i style="color:red">"person"</i></b>, sampleStream, TrainingParameters.defaultParams(), + TokenNameFinderFactory nameFinderFactory); +} +<b class="hl-keyword">finally</b> { + sampleStream.close(); +} + +<b class="hl-keyword">try</b> { + modelOut = <b class="hl-keyword">new</b> BufferedOutputStream(<b class="hl-keyword">new</b> FileOutputStream(modelFile)); + model.serialize(modelOut); +} <b class="hl-keyword">finally</b> { + <b class="hl-keyword">if</b> (modelOut != null) + modelOut.close(); +} + </pre><p> + </p> + </div> + + <div class="section" title="Custom Feature Generation"><div class="titlepage"><div><div><h3 class="title"><a name="tools.namefind.training.featuregen"></a>Custom Feature Generation</h3></div></div></div> + + <p> + OpenNLP defines a default feature generation which is used when no custom feature + generation is specified. Users which want to experiment with the feature generation + can provide a custom feature generator. Either via API or via an xml descriptor file. + </p> + <div class="section" title="Feature Generation defined by API"><div class="titlepage"><div><div><h4 class="title"><a name="tools.namefind.training.featuregen.api"></a>Feature Generation defined by API</h4></div></div></div> + + <p> + The custom generator must be used for training + and for detecting the names. If the feature generation during training time and detection + time is different the name finder might not be able to detect names. + The following lines show how to construct a custom feature generator + </p><pre class="programlisting"> + +AdaptiveFeatureGenerator featureGenerator = <b class="hl-keyword">new</b> CachedFeatureGenerator( + <b class="hl-keyword">new</b> AdaptiveFeatureGenerator[]{ + <b class="hl-keyword">new</b> WindowFeatureGenerator(<b class="hl-keyword">new</b> TokenFeatureGenerator(), <span class="hl-number">2</span>, <span class="hl-number">2</span>), + <b class="hl-keyword">new</b> WindowFeatureGenerator(<b class="hl-keyword">new</b> TokenClassFeatureGenerator(true), <span class="hl-number">2</span>, <span class="hl-number">2</span>), + <b class="hl-keyword">new</b> OutcomePriorFeatureGenerator(), + <b class="hl-keyword">new</b> PreviousMapFeatureGenerator(), + <b class="hl-keyword">new</b> BigramNameFeatureGenerator(), + <b class="hl-keyword">new</b> SentenceFeatureGenerator(true, false), + <b class="hl-keyword">new</b> BrownTokenFeatureGenerator(BrownCluster dictResource) + }); + </pre><p> + which is similar to the default feature generator but with a BrownTokenFeature added. + The javadoc of the feature generator classes explain what the individual feature generators do. + To write a custom feature generator please implement the AdaptiveFeatureGenerator interface or + if it must not be adaptive extend the FeatureGeneratorAdapter. + The train method which should be used is defined as + </p><pre class="programlisting"> + +<b class="hl-keyword">public</b> <b class="hl-keyword">static</b> TokenNameFinderModel train(String languageCode, String type, + ObjectStream<NameSample> samples, TrainingParameters trainParams, + TokenNameFinderFactory factory) <b class="hl-keyword">throws</b> IOException + </pre><p> + where the TokenNameFinderFactory allows to specify a custom feature generator. + To detect names the model which was returned from the train method must be passed to the NameFinderME constructor. + </p><pre class="programlisting"> + +<b class="hl-keyword">new</b> NameFinderME(model); + </pre><p> + </p> + </div> + <div class="section" title="Feature Generation defined by XML Descriptor"><div class="titlepage"><div><div><h4 class="title"><a name="tools.namefind.training.featuregen.xml"></a>Feature Generation defined by XML Descriptor</h4></div></div></div> + + <p> + OpenNLP can also use a xml descriptor file to configure the feature generation. The + descriptor + file is stored inside the model after training and the feature generators are configured + correctly when the name finder is instantiated. + + The following sample shows a xml descriptor which contains the default feature generator plus several types of clustering features: + </p><pre class="programlisting"> + +<b class="hl-tag" style="color: #000096"><generators></b> + <b class="hl-tag" style="color: #000096"><cache></b> + <b class="hl-tag" style="color: #000096"><generators></b> + <b class="hl-tag" style="color: #000096"><window</b> <span class="hl-attribute" style="color: #F5844C">prevLength</span> = <span class="hl-value" style="color: #993300">"2"</span> <span class="hl-attribute" style="color: #F5844C">nextLength</span> = <span class="hl-value" style="color: #993300">"2"</span><b class="hl-tag" style="color: #000096">></b> + <b class="hl-tag" style="color: #000096"><tokenclass/></b> + <b class="hl-tag" style="color: #000096"></window></b> + <b class="hl-tag" style="color: #000096"><window</b> <span class="hl-attribute" style="color: #F5844C">prevLength</span> = <span class="hl-value" style="color: #993300">"2"</span> <span class="hl-attribute" style="color: #F5844C">nextLength</span> = <span class="hl-value" style="color: #993300">"2"</span><b class="hl-tag" style="color: #000096">></b> +
<TRUNCATED> http://git-wip-us.apache.org/repos/asf/opennlp-site/blob/95530810/src/main/jbake/assets/android-icon-144x144.png ---------------------------------------------------------------------- diff --git a/src/main/jbake/assets/android-icon-144x144.png b/src/main/jbake/assets/android-icon-144x144.png new file mode 100644 index 0000000..ee52085 Binary files /dev/null and b/src/main/jbake/assets/android-icon-144x144.png differ