Author: ragerri
Date: Thu Sep  3 15:12:04 2015
New Revision: 1701045

URL: http://svn.apache.org/r1701045
Log:
OPENNLP-811 update namefinder documentation

Modified:
    opennlp/trunk/opennlp-docs/src/docbkx/namefinder.xml

Modified: opennlp/trunk/opennlp-docs/src/docbkx/namefinder.xml
URL: 
http://svn.apache.org/viewvc/opennlp/trunk/opennlp-docs/src/docbkx/namefinder.xml?rev=1701045&r1=1701044&r2=1701045&view=diff
==============================================================================
--- opennlp/trunk/opennlp-docs/src/docbkx/namefinder.xml (original)
+++ opennlp/trunk/opennlp-docs/src/docbkx/namefinder.xml Thu Sep  3 15:12:04 
2015
@@ -188,8 +188,7 @@ Span nameSpans[] = nameFinder.find(sente
                        The sentence must be tokenized and contain spans which 
mark the entities. Documents are separated by
                        empty lines which trigger the reset of the adaptive 
feature generators. A training file can contain
                        multiple types. If the training file contains multiple 
types the created model will also be able to
-                       detect these multiple types. For now it is recommended 
to only train single type models, since multi
-                       type support is still experimental.
+                       detect these multiple types.
                </para>
                <para>
                        Sample sentence of the data:
@@ -203,28 +202,27 @@ Mr . <START:person> Vinken <END> is chai
                        <screen>
                                <![CDATA[
 $ opennlp TokenNameFinderTrainer
-Usage: opennlp TokenNameFinderTrainer[.bionlp2004|.conll03|.conll02|.ad] 
[-resources resourcesDir] \
-               [-type modelType] [-featuregen featuregenFile] [-params 
paramsFile] \
-               [-iterations num] [-cutoff num] -model modelFile -lang language 
\
-               -data sampleData [-encoding charsetName]
+Usage: opennlp 
TokenNameFinderTrainer[.evalita|.ad|.conll03|.bionlp2004|.conll02|.muc6|.ontonotes|.brat]
 [-featuregen featuregenFile] [-nameTypes types] [-sequenceCodec codec] 
[-factory factoryName] [-resources resourcesDir] [-type modelType] [-params 
paramsFile] -lang language -model modelFile -data sampleData [-encoding 
charsetName]
 
 Arguments description:
+        -featuregen featuregenFile
+                The feature generator descriptor file
+        -nameTypes types
+                name types to use for training
+        -sequenceCodec codec
+                sequence codec used to code name spans
+        -factory factoryName
+                A sub-class of TokenNameFinderFactory
         -resources resourcesDir
                 The resources directory
         -type modelType
                 The type of the token name finder model
-        -featuregen featuregenFile
-                The feature generator descriptor file
         -params paramsFile
                 training parameters file.
-        -iterations num
-                number of training iterations, ignored if -params is used.
-        -cutoff num
-                minimal number of times a feature must be seen, ignored if 
-params is used.
-        -model modelFile
-                output model file.
         -lang language
                 language which is being processed.
+        -model modelFile
+                output model file.
         -data sampleData
                 data to be used, usually a file name.
         -encoding charsetName
@@ -237,8 +235,25 @@ Arguments description:
                                <![CDATA[
 $ opennlp TokenNameFinderTrainer -model en-ner-person.bin -lang en -data 
en-ner-person.train -encoding UTF-8]]>
                         </screen>
+The example above will train models with a pre-defined feature set. It is also 
possible to use the -resources parameter to generate features based on external 
knowledge such as those based on word representation (clustering) features. The 
external resources must all be placed in a resource directory which is then 
passed as a parameter. If this option is used it is then required to pass, via 
the -featuregen parameter, a XML custom feature generator which includes some 
of the clustering features shipped with the TokenNameFinder. Currently three 
formats of clustering lexicons are accepted:
+                       <itemizedlist>
+                               <listitem>
+                                       <para>Space separated two column file 
specifying the token and the cluster class as generated by toolkits such as 
<ulink url="https://code.google.com/p/word2vec/";>word2vec</ulink>.</para>
+                               </listitem>
+                               <listitem>
+                                       <para>Space separated three column file 
specifying the token, clustering class and weight as such as <ulink 
url="https://github.com/ninjin/clark_pos_induction";>Clark's 
clusters</ulink>.</para>
+                               </listitem>
+                               <listitem>
+                                       <para>Tab separated three column Brown 
clusters as generated by <ulink  
url="https://github.com/percyliang/brown-cluster";>
+                                               Liang's toolkit</ulink>.</para>
+                               </listitem>
+                       </itemizedlist>
                         Additionally it is possible to specify the number of 
iterations,
-                        the cutoff and to overwrite all types in the training 
data with a single type.
+                        the cutoff and to overwrite all types in the training 
data with a single type. Finally, the -sequenceCodec parameter allows to 
specify a BIO (Begin, Inside, Out) or BILOU (Begin, Inside, Last, Out, Unit) 
encoding to represent the Named Entities. An example of one such command would 
be as follows:
+                        <screen>
+                          <![CDATA[
+$ opennlp TokenNameFinderTrainer -featuregen brown.xml -sequenceCodec BILOU 
-resources clusters/ -params lang/ml/PerceptronTrainerParams.txt -lang en 
-model ner-test.bin -data en-train.opennlp -encoding UTF-8]]>
+                        </screen>
                </para>
                </section>
                <section id="tools.namefind.training.api">
@@ -270,7 +285,7 @@ TokenNameFinderModel model;
 
 try {
   model = NameFinderME.train("en", "person", sampleStream, 
TrainingParameters.defaultParams(),
-            null, Collections.<String, Object>emptyMap());
+            TokenNameFinderFactory nameFinderFactory);
 }
 finally {
   sampleStream.close();
@@ -310,25 +325,26 @@ AdaptiveFeatureGenerator featureGenerato
            new OutcomePriorFeatureGenerator(),
            new PreviousMapFeatureGenerator(),
            new BigramNameFeatureGenerator(),
-           new SentenceFeatureGenerator(true, false)
+           new SentenceFeatureGenerator(true, false),
+           new BrownTokenFeatureGenerator(BrownCluster dictResource)
            });]]>
                                </programlisting>
-                               which is similar to the default feature 
generator.
+                               which is similar to the default feature 
generator but with a BrownTokenFeature added.
                                The javadoc of the feature generator classes 
explain what the individual feature generators do.
                                To write a custom feature generator please 
implement the AdaptiveFeatureGenerator interface or
                                if it must not be adaptive extend the 
FeatureGeneratorAdapter.
                                The train method which should be used is 
defined as
                                <programlisting language="java">
                                        <![CDATA[
-public static TokenNameFinderModel train(String languageCode, String type, 
ObjectStream<NameSample> samples, 
-       TrainingParameters trainParams, AdaptiveFeatureGenerator generator, 
final Map<String, Object> resources) throws IOException]]>
+public static TokenNameFinderModel train(String languageCode, String type,
+          ObjectStream<NameSample> samples, TrainingParameters trainParams,
+          TokenNameFinderFactory factory) throws IOException]]>
                                </programlisting>
-                               and can take feature generator as an argument.
-                               To detect names the model which was returned 
from the train method and the
-                               feature generator must be passed to the 
NameFinderME constructor.
+                               where the TokenNameFinderFactory allows to 
specify a custom feature generator.
+                               To detect names the model which was returned 
from the train method must be passed to the NameFinderME constructor.
                                <programlisting language="java">
                                        <![CDATA[
-new NameFinderME(model, featureGenerator, NameFinderME.DEFAULT_BEAM_SIZE);]]>
+new NameFinderME(model);]]>
                                 </programlisting>       
                        </para>
                        </section>
@@ -340,7 +356,7 @@ new NameFinderME(model, featureGenerator
                        file is stored inside the model after training and the 
feature generators are configured
                        correctly when the name finder is instantiated.
                        
-                       The following sample shows a xml descriptor:
+                       The following sample shows a xml descriptor which 
contains the default feature generator plus several types of clustering 
features:
                                <programlisting language="xml">
                                        <![CDATA[
 <generators>
@@ -356,6 +372,13 @@ new NameFinderME(model, featureGenerator
       <prevmap/>
       <bigram/>
       <sentence begin="true" end="false"/>
+      <window prevLength = "2" nextLength = "2">
+        <brownclustertoken dict="brownCluster" />
+      </window>
+      <brownclustertokenclass dict="brownCluster" />
+      <brownclusterbigram dict="brownCluster" />
+      <wordcluster dict="word2vec.cluster" />
+      <wordcluster dict="clark.cluster" />
     </generators>
   </cache> 
 </generators>]]>
@@ -435,6 +458,26 @@ new NameFinderME(model, featureGenerator
                                        <entry>none</entry>
                              </row>
                              <row>
+                               <entry>wordcluster</entry>
+                               <entry>no</entry>
+                               <entry><emphasis>dict</emphasis> is the key of 
the clustering resource to use</entry>
+                             </row>
+                             <row>
+                               <entry>brownclustertoken</entry>
+                               <entry>no</entry>
+                               <entry><emphasis>dict</emphasis> is the key of 
the clustering resource to use</entry>
+                               </row>
+                               <row>
+                               <entry>brownclustertokenclass</entry>
+                               <entry>no</entry>
+                               <entry><emphasis>dict</emphasis> is the key of 
the clustering resource to use</entry>
+                             </row>
+                             <row>
+                               <entry>brownclusterbigram</entry>
+                               <entry>no</entry>
+                               <entry><emphasis>dict</emphasis> is the key of 
the clustering resource to use</entry>
+                             </row>
+                             <row>
                                        <entry>window</entry>
                                        <entry>yes</entry>
                                        <entry><emphasis>prevLength</emphasis> 
and <emphasis>nextLength</emphasis> must be integers ans specify the window 
size</entry>
@@ -552,4 +595,4 @@ System.out.println(result.toString());]]
                        </itemizedlist>
                </para>
                </section>
-</chapter>
\ No newline at end of file
+</chapter>


Reply via email to