opennlp git commit: OPENNLP-979 Update lemmatizer doc after API change

joern Wed, 17 May 2017 14:12:07 -0700

Repository: opennlp
Updated Branches:
  refs/heads/master 839ff1099 -> ee9fdb8aa



OPENNLP-979 Update lemmatizer doc after API change


Project: http://git-wip-us.apache.org/repos/asf/opennlp/repo
Commit: http://git-wip-us.apache.org/repos/asf/opennlp/commit/ee9fdb8a
Tree: http://git-wip-us.apache.org/repos/asf/opennlp/tree/ee9fdb8a
Diff: http://git-wip-us.apache.org/repos/asf/opennlp/diff/ee9fdb8a

Branch: refs/heads/master
Commit: ee9fdb8aad0e4c43bba85e50be3687475bf2221d
Parents: 839ff10
Author: Rodrigo Agerri <[email protected]>
Authored: Wed May 17 23:04:23 2017 +0200
Committer: Rodrigo Agerri <[email protected]>
Committed: Wed May 17 23:04:23 2017 +0200

----------------------------------------------------------------------
 opennlp-docs/src/docbkx/lemmatizer.xml | 54 ++++++++++++++++-------------
 1 file changed, 30 insertions(+), 24 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/opennlp/blob/ee9fdb8a/opennlp-docs/src/docbkx/lemmatizer.xml
----------------------------------------------------------------------
diff --git a/opennlp-docs/src/docbkx/lemmatizer.xml 
b/opennlp-docs/src/docbkx/lemmatizer.xml
index 1fa5540..630b04d 100644
--- a/opennlp-docs/src/docbkx/lemmatizer.xml
+++ b/opennlp-docs/src/docbkx/lemmatizer.xml
@@ -121,10 +121,9 @@ String[] postags = new String[] { "NNP", "NNP", "NNP", 
"POS", "NNP", "NN",
     "NNP", "NNP", "TO", "VB", "JJ", "NNS", "IN", "NNP", "POS", "CD", "NNS",
     "." };
 
-String[] lemmas = lemmatizer.lemmatize(tokens, postags);
-String[] decodedLemmas = lemmatizer.decodeLemmas(tokens, lemmas);]]>
+String[] lemmas = lemmatizer.lemmatize(tokens, postags);]]>
                </programlisting>
-                               The decodedLemmas array contains one lemma for 
each token in the
+                               The lemmas array contains one lemma for each 
token in the
                                input array. The corresponding
                                tag and lemma can be found at the same index as 
the token has in the
                                input array.
@@ -133,29 +132,37 @@ String[] decodedLemmas = lemmatizer.decodeLemmas(tokens, 
lemmas);]]>
                        <para>
                                The DictionaryLemmatizer is constructed
                                by passing the InputStream of a lemmatizer 
dictionary. Such dictionary
-                               consists of a
-                               text file containing, for each row, a word, its 
postag and the
-                               corresponding lemma:
+                               consists of a text file containing, for each 
row, a word, its postag and the
+                               corresponding lemma, each column separated by a 
tab character.
                                <screen>
                <![CDATA[
-show    NN      show
-showcase        NN      showcase
-showcases       NNS     showcase
-showdown        NN      showdown
-showdowns       NNS     showdown
-shower  NN      shower
-showers NNS     shower
-showman NN      showman
-showmanship     NN      showmanship
-showmen NNS     showman
-showroom        NN      showroom
-showrooms       NNS     showroom
-shows   NNS     show
-showstopper     NN      showstopper
-showstoppers    NNS     showstopper
-shrapnel        NN      shrapnel
+show           NN      show
+showcase       NN      showcase
+showcases      NNS     showcase
+showdown       NN      showdown
+showdowns      NNS     showdown
+shower         NN      shower
+showers                NNS     shower
+showman                NN      showman
+showmanship    NN      showmanship
+showmen                NNS     showman
+showroom       NN      showroom
+showrooms      NNS     showroom
+shows          NNS     show
+shrapnel       NN      shrapnel
                ]]>
                </screen>
+                               Alternatively, if a (word,postag) pair can 
output multiple lemmas, the
+                               the lemmatizer dictionary would consists of a 
text file containing, for 
+                               each row, a word, its postag and the 
corresponding lemmas separated by "#":
+                               <screen>
+               <![CDATA[
+muestras       NN      muestra
+cantaba                V       cantar
+fue            V       ir#ser
+entramos       V       entrar
+               ]]>
+                                       </screen>
                                First the dictionary must be loaded into memory 
from disk or another
                                source.
                                In the sample below it is loaded from disk.
@@ -180,8 +187,7 @@ DictionaryLemmatizer lemmatizer = new 
DictionaryLemmatizer(dictLemmatizer);]]>
                        </para>
                        <para>
                                The following code shows how to find a lemma 
using a
-                               DictionaryLemmatizer. There is no need to 
decode the
-                               lemmas when using the DictionaryLemmatizer.
+                               DictionaryLemmatizer.
                                <programlisting language="java">
                  <![CDATA[
 String[] tokens = new String[]{"Most", "large", "cities", "in", "the", "US", 
"had",

opennlp git commit: OPENNLP-979 Update lemmatizer doc after API change

Reply via email to