Author: ragerri
Date: Thu Sep 3 15:12:04 2015
New Revision: 1701045
URL: http://svn.apache.org/r1701045
Log:
OPENNLP-811 update namefinder documentation
Modified:
opennlp/trunk/opennlp-docs/src/docbkx/namefinder.xml
Modified: opennlp/trunk/opennlp-docs/src/docbkx/namefinder.xml
URL:
http://svn.apache.org/viewvc/opennlp/trunk/opennlp-docs/src/docbkx/namefinder.xml?rev=1701045&r1=1701044&r2=1701045&view=diff
==============================================================================
--- opennlp/trunk/opennlp-docs/src/docbkx/namefinder.xml (original)
+++ opennlp/trunk/opennlp-docs/src/docbkx/namefinder.xml Thu Sep 3 15:12:04
2015
@@ -188,8 +188,7 @@ Span nameSpans[] = nameFinder.find(sente
The sentence must be tokenized and contain spans which
mark the entities. Documents are separated by
empty lines which trigger the reset of the adaptive
feature generators. A training file can contain
multiple types. If the training file contains multiple
types the created model will also be able to
- detect these multiple types. For now it is recommended
to only train single type models, since multi
- type support is still experimental.
+ detect these multiple types.
</para>
<para>
Sample sentence of the data:
@@ -203,28 +202,27 @@ Mr . <START:person> Vinken <END> is chai
<screen>
<![CDATA[
$ opennlp TokenNameFinderTrainer
-Usage: opennlp TokenNameFinderTrainer[.bionlp2004|.conll03|.conll02|.ad]
[-resources resourcesDir] \
- [-type modelType] [-featuregen featuregenFile] [-params
paramsFile] \
- [-iterations num] [-cutoff num] -model modelFile -lang language
\
- -data sampleData [-encoding charsetName]
+Usage: opennlp
TokenNameFinderTrainer[.evalita|.ad|.conll03|.bionlp2004|.conll02|.muc6|.ontonotes|.brat]
[-featuregen featuregenFile] [-nameTypes types] [-sequenceCodec codec]
[-factory factoryName] [-resources resourcesDir] [-type modelType] [-params
paramsFile] -lang language -model modelFile -data sampleData [-encoding
charsetName]
Arguments description:
+ -featuregen featuregenFile
+ The feature generator descriptor file
+ -nameTypes types
+ name types to use for training
+ -sequenceCodec codec
+ sequence codec used to code name spans
+ -factory factoryName
+ A sub-class of TokenNameFinderFactory
-resources resourcesDir
The resources directory
-type modelType
The type of the token name finder model
- -featuregen featuregenFile
- The feature generator descriptor file
-params paramsFile
training parameters file.
- -iterations num
- number of training iterations, ignored if -params is used.
- -cutoff num
- minimal number of times a feature must be seen, ignored if
-params is used.
- -model modelFile
- output model file.
-lang language
language which is being processed.
+ -model modelFile
+ output model file.
-data sampleData
data to be used, usually a file name.
-encoding charsetName
@@ -237,8 +235,25 @@ Arguments description:
<![CDATA[
$ opennlp TokenNameFinderTrainer -model en-ner-person.bin -lang en -data
en-ner-person.train -encoding UTF-8]]>
</screen>
+The example above will train models with a pre-defined feature set. It is also
possible to use the -resources parameter to generate features based on external
knowledge such as those based on word representation (clustering) features. The
external resources must all be placed in a resource directory which is then
passed as a parameter. If this option is used it is then required to pass, via
the -featuregen parameter, a XML custom feature generator which includes some
of the clustering features shipped with the TokenNameFinder. Currently three
formats of clustering lexicons are accepted:
+ <itemizedlist>
+ <listitem>
+ <para>Space separated two column file
specifying the token and the cluster class as generated by toolkits such as
<ulink url="https://code.google.com/p/word2vec/">word2vec</ulink>.</para>
+ </listitem>
+ <listitem>
+ <para>Space separated three column file
specifying the token, clustering class and weight as such as <ulink
url="https://github.com/ninjin/clark_pos_induction">Clark's
clusters</ulink>.</para>
+ </listitem>
+ <listitem>
+ <para>Tab separated three column Brown
clusters as generated by <ulink
url="https://github.com/percyliang/brown-cluster">
+ Liang's toolkit</ulink>.</para>
+ </listitem>
+ </itemizedlist>
Additionally it is possible to specify the number of
iterations,
- the cutoff and to overwrite all types in the training
data with a single type.
+ the cutoff and to overwrite all types in the training
data with a single type. Finally, the -sequenceCodec parameter allows to
specify a BIO (Begin, Inside, Out) or BILOU (Begin, Inside, Last, Out, Unit)
encoding to represent the Named Entities. An example of one such command would
be as follows:
+ <screen>
+ <![CDATA[
+$ opennlp TokenNameFinderTrainer -featuregen brown.xml -sequenceCodec BILOU
-resources clusters/ -params lang/ml/PerceptronTrainerParams.txt -lang en
-model ner-test.bin -data en-train.opennlp -encoding UTF-8]]>
+ </screen>
</para>
</section>
<section id="tools.namefind.training.api">
@@ -270,7 +285,7 @@ TokenNameFinderModel model;
try {
model = NameFinderME.train("en", "person", sampleStream,
TrainingParameters.defaultParams(),
- null, Collections.<String, Object>emptyMap());
+ TokenNameFinderFactory nameFinderFactory);
}
finally {
sampleStream.close();
@@ -310,25 +325,26 @@ AdaptiveFeatureGenerator featureGenerato
new OutcomePriorFeatureGenerator(),
new PreviousMapFeatureGenerator(),
new BigramNameFeatureGenerator(),
- new SentenceFeatureGenerator(true, false)
+ new SentenceFeatureGenerator(true, false),
+ new BrownTokenFeatureGenerator(BrownCluster dictResource)
});]]>
</programlisting>
- which is similar to the default feature
generator.
+ which is similar to the default feature
generator but with a BrownTokenFeature added.
The javadoc of the feature generator classes
explain what the individual feature generators do.
To write a custom feature generator please
implement the AdaptiveFeatureGenerator interface or
if it must not be adaptive extend the
FeatureGeneratorAdapter.
The train method which should be used is
defined as
<programlisting language="java">
<![CDATA[
-public static TokenNameFinderModel train(String languageCode, String type,
ObjectStream<NameSample> samples,
- TrainingParameters trainParams, AdaptiveFeatureGenerator generator,
final Map<String, Object> resources) throws IOException]]>
+public static TokenNameFinderModel train(String languageCode, String type,
+ ObjectStream<NameSample> samples, TrainingParameters trainParams,
+ TokenNameFinderFactory factory) throws IOException]]>
</programlisting>
- and can take feature generator as an argument.
- To detect names the model which was returned
from the train method and the
- feature generator must be passed to the
NameFinderME constructor.
+ where the TokenNameFinderFactory allows to
specify a custom feature generator.
+ To detect names the model which was returned
from the train method must be passed to the NameFinderME constructor.
<programlisting language="java">
<![CDATA[
-new NameFinderME(model, featureGenerator, NameFinderME.DEFAULT_BEAM_SIZE);]]>
+new NameFinderME(model);]]>
</programlisting>
</para>
</section>
@@ -340,7 +356,7 @@ new NameFinderME(model, featureGenerator
file is stored inside the model after training and the
feature generators are configured
correctly when the name finder is instantiated.
- The following sample shows a xml descriptor:
+ The following sample shows a xml descriptor which
contains the default feature generator plus several types of clustering
features:
<programlisting language="xml">
<![CDATA[
<generators>
@@ -356,6 +372,13 @@ new NameFinderME(model, featureGenerator
<prevmap/>
<bigram/>
<sentence begin="true" end="false"/>
+ <window prevLength = "2" nextLength = "2">
+ <brownclustertoken dict="brownCluster" />
+ </window>
+ <brownclustertokenclass dict="brownCluster" />
+ <brownclusterbigram dict="brownCluster" />
+ <wordcluster dict="word2vec.cluster" />
+ <wordcluster dict="clark.cluster" />
</generators>
</cache>
</generators>]]>
@@ -435,6 +458,26 @@ new NameFinderME(model, featureGenerator
<entry>none</entry>
</row>
<row>
+ <entry>wordcluster</entry>
+ <entry>no</entry>
+ <entry><emphasis>dict</emphasis> is the key of
the clustering resource to use</entry>
+ </row>
+ <row>
+ <entry>brownclustertoken</entry>
+ <entry>no</entry>
+ <entry><emphasis>dict</emphasis> is the key of
the clustering resource to use</entry>
+ </row>
+ <row>
+ <entry>brownclustertokenclass</entry>
+ <entry>no</entry>
+ <entry><emphasis>dict</emphasis> is the key of
the clustering resource to use</entry>
+ </row>
+ <row>
+ <entry>brownclusterbigram</entry>
+ <entry>no</entry>
+ <entry><emphasis>dict</emphasis> is the key of
the clustering resource to use</entry>
+ </row>
+ <row>
<entry>window</entry>
<entry>yes</entry>
<entry><emphasis>prevLength</emphasis>
and <emphasis>nextLength</emphasis> must be integers ans specify the window
size</entry>
@@ -552,4 +595,4 @@ System.out.println(result.toString());]]
</itemizedlist>
</para>
</section>
-</chapter>
\ No newline at end of file
+</chapter>