Repository: opennlp
Updated Branches:
  refs/heads/902 001b97068 -> 4f2441bc1


Adds a small documentation section for Morfologik add-on

See issue OPENNLP-902


Project: http://git-wip-us.apache.org/repos/asf/opennlp/repo
Commit: http://git-wip-us.apache.org/repos/asf/opennlp/commit/4f2441bc
Tree: http://git-wip-us.apache.org/repos/asf/opennlp/tree/4f2441bc
Diff: http://git-wip-us.apache.org/repos/asf/opennlp/diff/4f2441bc

Branch: refs/heads/902
Commit: 4f2441bc1b50502b95a86bff94e8a9544322baf5
Parents: 001b970
Author: William Colen <[email protected]>
Authored: Wed Dec 28 01:43:55 2016 -0200
Committer: William Colen <[email protected]>
Committed: Wed Dec 28 01:43:55 2016 -0200

----------------------------------------------------------------------
 .../src/docbkx/morfologik-addon.out.xml         |   0
 opennlp-docs/src/docbkx/morfologik-addon.xml    | 153 +++++++++++++++++++
 opennlp-docs/src/docbkx/opennlp.xml             |   1 +
 3 files changed, 154 insertions(+)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/opennlp/blob/4f2441bc/opennlp-docs/src/docbkx/morfologik-addon.out.xml
----------------------------------------------------------------------
diff --git a/opennlp-docs/src/docbkx/morfologik-addon.out.xml 
b/opennlp-docs/src/docbkx/morfologik-addon.out.xml
new file mode 100644
index 0000000..e69de29

http://git-wip-us.apache.org/repos/asf/opennlp/blob/4f2441bc/opennlp-docs/src/docbkx/morfologik-addon.xml
----------------------------------------------------------------------
diff --git a/opennlp-docs/src/docbkx/morfologik-addon.xml 
b/opennlp-docs/src/docbkx/morfologik-addon.xml
new file mode 100644
index 0000000..6f18844
--- /dev/null
+++ b/opennlp-docs/src/docbkx/morfologik-addon.xml
@@ -0,0 +1,153 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.4//EN"
+"http://www.oasis-open.org/docbook/xml/4.4/docbookx.dtd";[
+]>
+<!-- Licensed to the Apache Software Foundation (ASF) under one or more 
contributor 
+       license agreements. See the NOTICE file distributed with this work for 
additional 
+       information regarding copyright ownership. The ASF licenses this file 
to 
+       you under the Apache License, Version 2.0 (the "License"); you may not 
use 
+       this file except in compliance with the License. You may obtain a copy 
of 
+       the License at http://www.apache.org/licenses/LICENSE-2.0 Unless 
required 
+       by applicable law or agreed to in writing, software distributed under 
the 
+       License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR 
CONDITIONS 
+       OF ANY KIND, either express or implied. See the License for the 
specific 
+       language governing permissions and limitations under the License. -->
+
+
+<chapter id="tools.morfologik-addon">
+       <title>Morfologik Addon</title>
+               <para>
+                       <ulink 
url="https://github.com/morfologik/morfologik-stemming";><citetitle>Morfologik</citetitle></ulink>
+                       provides tools for finite state automata (FSA) 
construction and dictionary-based morphological dictionaries.
+               </para>
+               <para>
+                       The Morfologik Addon implements OpenNLP interfaces and 
extensions to allow the use of FSA Morfologik dictionary tools.
+               </para>
+               <section id="tools.morfologik-addon.api">
+                       <title>Morfologik Integration</title>
+                       <para>
+                       To allow for an easy integration with OpenNLP, the 
following implementations are provided:
+                       <itemizedlist mark='opencircle'>
+                               <listitem>
+                                       <para>
+                                       The 
<code>MorfologikPOSTaggerFactory</code> extends <code>POSTaggerFactory</code>, 
which helps creating a POSTagger model with an embedded FSA TagDictionary.
+                                       </para>
+                               </listitem>
+                               <listitem>
+                                       <para>
+                                       The 
<code>MorfologikTagDictionary</code> implements a FSA based 
<code>TagDictionary</code>, allowing for much smaller files than the default 
XML based with improved memory consumption.
+                                       </para>
+                               </listitem>
+                               <listitem>
+                                       <para>
+                                       The <code>MorfologikLemmatizer</code> 
implements a FSA based <code>Lemmatizer</code> dictionaries.
+                                       </para>
+                               </listitem>
+                       </itemizedlist>
+               </para>
+               <para>
+               The first two implementations can be used directly from command 
line, as in the example bellow. Having a FSA Morfologik dictionary (see next 
section how to build one), you can train a POS Tagger
+               model with an embedded FSA dictionary. 
+               </para>
+               <para>
+               The example trains a POSTagger with a CONLL corpus named 
<code>portuguese_bosque_train.conll</code> and a FSA dictionary named 
+               <code>pt-morfologik.dict</code>. It will output a model named 
<code>pos-pt_fsadic.model</code>.
+               
+               <screen>
+               <![CDATA[
+$ bin/opennlp POSTaggerTrainer -type perceptron -lang pt -model 
pos-pt_fsadic.model -data portuguese_bosque_train.conll \
+        -encoding UTF-8 -factory 
opennlp.morfologik.tagdict.MorfologikPOSTaggerFactory -dict 
pt-morfologik.dict]]>
+               </screen>
+               
+               </para>
+               <para>
+               Another example follows. It shows how to use the 
<code>MorfologikLemmatizer</code>. You will need a lemma dictionary and info 
file, in this example, we will use a very small Portuguese dictionary. 
+               Its syntax is <code>lemma,lexeme,postag</code>.
+               </para>
+               <para>
+               File <code>lemmaDictionary.txt:</code>
+               <screen>
+               <![CDATA[
+casa,casa,NOUN
+casar,casa,V
+casar,casar,V-INF
+Casa,Casa,PROP
+casa,casinha,NOUN
+casa,casona,NOUN
+menino,menina,NOUN
+menino,menino,NOUN
+menino,meninão,NOUN
+menino,menininho,NOUN
+carro,carro,NOUN]]>
+               </screen>
+               </para>
+               <para>
+               Mandatory metadata file, which must have the same name but 
.info extension <code>lemmaDictionary.info:</code>
+               <screen>
+               <![CDATA[
+#
+# REQUIRED PROPERTIES
+#
+
+# Column (lemma, inflected, tag) separator. This must be a single byte in the 
target encoding.
+fsa.dict.separator=,
+
+# The charset in which the input is encoded. UTF-8 is strongly recommended.
+fsa.dict.encoding=UTF-8
+
+# The type of lemma-inflected form encoding compression that precedes automaton
+# construction. Allowed values: [suffix, infix, prefix, none].
+# Details are in Daciuk's paper and in the code. 
+# Leave at 'prefix' if not sure.
+fsa.dict.encoder=prefix
+               ]]>
+               </screen>
+               </para>
+               <para>
+               The following code creates a binary FSA Morfologik dictionary, 
loads it in MorfologikLemmatizer and uses it to 
+               find the lemma the word "casa" noun and verb.
+               
+                               <programlisting language="java">
+               <![CDATA[
+// Part 1: compile a FSA lemma dictionary 
+   
+// we need the tabular dictionary. It is mandatory to have info 
+//  file with same name, but .info extension
+Path textLemmaDictionary = Paths.get("dictionaryWithLemma.txt");
+
+// this will build a binary dictionary located in compiledLemmaDictionary
+Path compiledLemmaDictionary = new MorfologikDictionayBuilder()
+    .build(textLemmaDictionary);
+
+// Part 2: load a MorfologikLemmatizer and use it
+MorfologikLemmatizer lemmatizer = new 
MorfologikLemmatizer(compiledLemmaDictionary);
+
+String[] toks = {"casa", "casa"};
+String[] tags = {"NOUN", "V"};
+
+String[] lemmas = lemmatizer.lemmatize(toks, tags);
+System.out.println(Arrays.toString(lemmas)); // outputs [casa, casar]
+    ]]>
+                       </programlisting>
+               
+               </para>
+               </section>
+               <section id="tools.morfologik-addon.cmdline">
+                       <title>Morfologik CLI Tools</title>
+                       <para>
+                               The Morfologik addon provides a command line 
tool. <code>XMLDictionaryToTable</code> makes easy to convert from an OpenNLP 
XML based dictionary
+                               to a tabular format. 
<code>MorfologikDictionaryBuilder</code> can take a tabular dictionary and 
output a binary Morfologik FSA dictionary.
+                       </para>
+                       <screen>
+               <![CDATA[
+$ sh bin/morfologik-addon
+OpenNLP Morfologik Addon. Usage: opennlp-morfologik-addon TOOL
+where TOOL is one of:
+  MorfologikDictionaryBuilder    builds a binary POS Dictionary using 
Morfologik
+  XMLDictionaryToTable           reads an OpenNLP XML tag dictionary and 
outputs it in a tabular file
+All tools print help when invoked with help parameter
+Example: opennlp-morfologik-addon POSDictionaryBuilder help
+               ]]>
+               </screen>
+               </section>
+</chapter>
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/opennlp/blob/4f2441bc/opennlp-docs/src/docbkx/opennlp.xml
----------------------------------------------------------------------
diff --git a/opennlp-docs/src/docbkx/opennlp.xml 
b/opennlp-docs/src/docbkx/opennlp.xml
index 257bbb4..172d06c 100644
--- a/opennlp-docs/src/docbkx/opennlp.xml
+++ b/opennlp-docs/src/docbkx/opennlp.xml
@@ -89,5 +89,6 @@ under the License.
        <xi:include xmlns:xi="http://www.w3.org/2001/XInclude"; 
href="./corpora.xml" />
        <xi:include xmlns:xi="http://www.w3.org/2001/XInclude"; 
href="./machine-learning.xml" />
        <xi:include xmlns:xi="http://www.w3.org/2001/XInclude"; 
href="./uima-integration.xml" />
+       <xi:include xmlns:xi="http://www.w3.org/2001/XInclude"; 
href="./morfologik-addon.xml" />
        <xi:include xmlns:xi="http://www.w3.org/2001/XInclude"; href="./cli.xml" 
/>
 </book>

Reply via email to