docbkx: lemmatizer.xml opennlp.xml

ragerri Tue, 08 Sep 2015 06:38:24 -0700

Author: ragerri
Date: Tue Sep  8 13:37:02 2015
New Revision: 1701804

URL: http://svn.apache.org/r1701804
Log:
OPENNLP-812 added lemmatizer documentation section; currently only for 
dictionary lemmatizer via API


Added:
    opennlp/trunk/opennlp-docs/src/docbkx/lemmatizer.xml
Modified:
    opennlp/trunk/opennlp-docs/src/docbkx/opennlp.xml

Added: opennlp/trunk/opennlp-docs/src/docbkx/lemmatizer.xml
URL: 
http://svn.apache.org/viewvc/opennlp/trunk/opennlp-docs/src/docbkx/lemmatizer.xml?rev=1701804&view=auto
==============================================================================
--- opennlp/trunk/opennlp-docs/src/docbkx/lemmatizer.xml (added)
+++ opennlp/trunk/opennlp-docs/src/docbkx/lemmatizer.xml Tue Sep  8 13:37:02 
2015
@@ -0,0 +1,125 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.4//EN"
+"http://www.oasis-open.org/docbook/xml/4.4/docbookx.dtd";[
+]>
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+   http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+<chapter id="tools.lemmatizer">
+<title>Lemmatizer</title>
+       <section id="tools.lemmatizer.tagging">
+               <title>Lemmatization</title>
+               <para>
+                 The lemmatizer returns, for a given word form (token) and 
Part of Speech tag,
+                 the dictionary form of a word, which is usually referred to 
as its lemma. A token could
+                 ambiguously be derived from several basic forms or dictionary 
words which is why the
+                 postag of the word is required to find the lemma. For 
example, the form `show' may refer
+                 to either the verb "to show" or to the noun "show".
+                 Currently OpenNLP implements a dictionary lookup lemmatizer, 
although the implementation of
+                 rule-based and probabilistic lemmatizers are pending
+                 (<ulink 
url="https://issues.apache.org/jira/browse/OPENNLP-683";>OPENNLP-683</ulink> and
+                 <ulink 
url="https://issues.apache.org/jira/browse/OPENNLP-760";>OPENNLP-760</ulink>).
+                 Contributions are very welcome!
+               </para>
+                       <section id="tools.lemmatizer.tagging.cmdline">
+               <title>Lemmatizer Tool</title>
+               <para>
+                 TODO: a command line tool for the lemmatizer is pending:
+                 <ulink 
url="https://issues.apache.org/jira/browse/OPENNLP-814";>OPENNLP-814</ulink>
+                 Contributions welcome!
+               </para>
+      </section>
+
+               <section id="tools.lemmatizer.tagging.api">
+               <title>Lemmatizer API</title>
+               <para>
+                 The Lemmatizer can be embedded into an application via its 
API. Currently only a
+                 dictionary-based SimpleLemmatizer is available. The 
SimpleLemmatizer is constructed
+                 by passing the InputStream of a lemmatizer dictionary. Such 
dictionary consists of a
+                 text file containing, for each row, a word, its postag and 
the corresponding lemma:
+                 <screen>
+               <![CDATA[
+show    NN      show
+showcase        NN      showcase
+showcases       NNS     showcase
+showdown        NN      showdown
+showdowns       NNS     showdown
+shower  NN      shower
+showers NNS     shower
+showman NN      showman
+showmanship     NN      showmanship
+showmen NNS     showman
+showroom        NN      showroom
+showrooms       NNS     showroom
+shows   NNS     show
+showstopper     NN      showstopper
+showstoppers    NNS     showstopper
+shrapnel        NN      shrapnel
+               ]]>
+               </screen>
+                       First the dictionary must be loaded into memory from 
disk or another source.
+                       In the sample below it is loaded from disk.
+                       <programlisting language="java">
+                               <![CDATA[
+InputStream dictLemmatizer = null;
+
+try {
+  dictLemmatizer = new FileInputStream("english-lemmatizer.txt");
+}
+catch (IOException e) {
+  // dictionary loading failed, handle the error
+  e.printStackTrace();
+}
+finally {
+  if (dictLemmatizer != null) {
+    try {
+      dictLemmatizer.close();
+    }
+    catch (IOException e) {
+    }
+  }
+}]]>
+                       </programlisting>
+                       After the dictionary is loaded the SimpleLemmatizer can 
be instantiated.
+                       <programlisting language="java">
+                         <![CDATA[
+SimpleLemmatizer lemmatizer = new SimpleLemmatizer(dictLemmatizer);]]>
+                       </programlisting>
+                       The SimpleLemmatizer instance is now ready. It expects 
two String objects as input,
+                       a token and its postag.
+          </para>
+          <para>
+          The following code shows how to find a lemma for a token postag pair 
and store them in an ArrayList.
+               <programlisting language="java">
+                 <![CDATA[
+String sent[] = new String[]{"Most", "large", "cities", "in", "the", "US", 
"had",
+                             "morning", "and", "afternoon", "newspapers", "."};
+String tags[] = tagger.tag(sent);
+List<String> lemmas = new ArrayList<String>();
+for (int i = 0; i < sent.length; i++) {
+  lemmas.add(lemmatizer.lemmatize(sent[i], tags[i]));
+}
+]]>
+                       </programlisting>
+                       The tags array contains one part-of-speech tag for each 
token in the input array. The corresponding
+                       tag and lemmas can be found at the same index as the 
token has in the input array.
+                        </para>
+       </section>
+       </section>
+</chapter>

Modified: opennlp/trunk/opennlp-docs/src/docbkx/opennlp.xml
URL: 
http://svn.apache.org/viewvc/opennlp/trunk/opennlp-docs/src/docbkx/opennlp.xml?rev=1701804&r1=1701803&r2=1701804&view=diff
==============================================================================
--- opennlp/trunk/opennlp-docs/src/docbkx/opennlp.xml (original)
+++ opennlp/trunk/opennlp-docs/src/docbkx/opennlp.xml Tue Sep  8 13:37:02 2015
@@ -81,6 +81,7 @@ under the License.
        <xi:include xmlns:xi="http://www.w3.org/2001/XInclude"; 
href="./namefinder.xml" />
        <xi:include xmlns:xi="http://www.w3.org/2001/XInclude"; 
href="./doccat.xml" />
        <xi:include xmlns:xi="http://www.w3.org/2001/XInclude"; 
href="./postagger.xml" />
+       <xi:include xmlns:xi="http://www.w3.org/2001/XInclude"; 
href="./lemmatizer.xml" />
        <xi:include xmlns:xi="http://www.w3.org/2001/XInclude"; 
href="./chunker.xml" />
        <xi:include xmlns:xi="http://www.w3.org/2001/XInclude"; 
href="./parser.xml" />
        <xi:include xmlns:xi="http://www.w3.org/2001/XInclude"; 
href="./coref.xml" />

svn commit: r1701804 - in /opennlp/trunk/opennlp-docs/src/docbkx: lemmatizer.xml opennlp.xml

Reply via email to