Repository: opennlp Updated Branches: refs/heads/master 839ff1099 -> ee9fdb8aa
OPENNLP-979 Update lemmatizer doc after API change Project: http://git-wip-us.apache.org/repos/asf/opennlp/repo Commit: http://git-wip-us.apache.org/repos/asf/opennlp/commit/ee9fdb8a Tree: http://git-wip-us.apache.org/repos/asf/opennlp/tree/ee9fdb8a Diff: http://git-wip-us.apache.org/repos/asf/opennlp/diff/ee9fdb8a Branch: refs/heads/master Commit: ee9fdb8aad0e4c43bba85e50be3687475bf2221d Parents: 839ff10 Author: Rodrigo Agerri <[email protected]> Authored: Wed May 17 23:04:23 2017 +0200 Committer: Rodrigo Agerri <[email protected]> Committed: Wed May 17 23:04:23 2017 +0200 ---------------------------------------------------------------------- opennlp-docs/src/docbkx/lemmatizer.xml | 54 ++++++++++++++++------------- 1 file changed, 30 insertions(+), 24 deletions(-) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/opennlp/blob/ee9fdb8a/opennlp-docs/src/docbkx/lemmatizer.xml ---------------------------------------------------------------------- diff --git a/opennlp-docs/src/docbkx/lemmatizer.xml b/opennlp-docs/src/docbkx/lemmatizer.xml index 1fa5540..630b04d 100644 --- a/opennlp-docs/src/docbkx/lemmatizer.xml +++ b/opennlp-docs/src/docbkx/lemmatizer.xml @@ -121,10 +121,9 @@ String[] postags = new String[] { "NNP", "NNP", "NNP", "POS", "NNP", "NN", "NNP", "NNP", "TO", "VB", "JJ", "NNS", "IN", "NNP", "POS", "CD", "NNS", "." }; -String[] lemmas = lemmatizer.lemmatize(tokens, postags); -String[] decodedLemmas = lemmatizer.decodeLemmas(tokens, lemmas);]]> +String[] lemmas = lemmatizer.lemmatize(tokens, postags);]]> </programlisting> - The decodedLemmas array contains one lemma for each token in the + The lemmas array contains one lemma for each token in the input array. The corresponding tag and lemma can be found at the same index as the token has in the input array. @@ -133,29 +132,37 @@ String[] decodedLemmas = lemmatizer.decodeLemmas(tokens, lemmas);]]> <para> The DictionaryLemmatizer is constructed by passing the InputStream of a lemmatizer dictionary. Such dictionary - consists of a - text file containing, for each row, a word, its postag and the - corresponding lemma: + consists of a text file containing, for each row, a word, its postag and the + corresponding lemma, each column separated by a tab character. <screen> <![CDATA[ -show NN show -showcase NN showcase -showcases NNS showcase -showdown NN showdown -showdowns NNS showdown -shower NN shower -showers NNS shower -showman NN showman -showmanship NN showmanship -showmen NNS showman -showroom NN showroom -showrooms NNS showroom -shows NNS show -showstopper NN showstopper -showstoppers NNS showstopper -shrapnel NN shrapnel +show NN show +showcase NN showcase +showcases NNS showcase +showdown NN showdown +showdowns NNS showdown +shower NN shower +showers NNS shower +showman NN showman +showmanship NN showmanship +showmen NNS showman +showroom NN showroom +showrooms NNS showroom +shows NNS show +shrapnel NN shrapnel ]]> </screen> + Alternatively, if a (word,postag) pair can output multiple lemmas, the + the lemmatizer dictionary would consists of a text file containing, for + each row, a word, its postag and the corresponding lemmas separated by "#": + <screen> + <![CDATA[ +muestras NN muestra +cantaba V cantar +fue V ir#ser +entramos V entrar + ]]> + </screen> First the dictionary must be loaded into memory from disk or another source. In the sample below it is loaded from disk. @@ -180,8 +187,7 @@ DictionaryLemmatizer lemmatizer = new DictionaryLemmatizer(dictLemmatizer);]]> </para> <para> The following code shows how to find a lemma using a - DictionaryLemmatizer. There is no need to decode the - lemmas when using the DictionaryLemmatizer. + DictionaryLemmatizer. <programlisting language="java"> <![CDATA[ String[] tokens = new String[]{"Most", "large", "cities", "in", "the", "US", "had",
