[opennlp] branch master updated: OPENNLP-1110: Changes visibility of LanguageDetectorContextGenerator Also it extracts an interface for it and update documentation.

colen Wed, 05 Jul 2017 05:47:17 -0700

This is an automated email from the ASF dual-hosted git repository.

colen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/opennlp.git



The following commit(s) were added to refs/heads/master by this push:
     new 3e8b8aa  OPENNLP-1110: Changes visibility of 
LanguageDetectorContextGenerator Also it extracts an interface for it and 
update documentation.
3e8b8aa is described below

commit 3e8b8aa06bb7742cd1643fad72d7739cda400010
Author: William D C M SILVA <[email protected]>
AuthorDate: Wed Jul 5 08:30:28 2017 -0300

    OPENNLP-1110: Changes visibility of LanguageDetectorContextGenerator
    Also it extracts an interface for it and update documentation.
---
 opennlp-docs/src/docbkx/corpora.xml                | 93 +---------------------
 opennlp-docs/src/docbkx/langdetect.xml             |  6 +-
 ...> DefaultLanguageDetectorContextGenerator.java} | 11 +--
 .../LanguageDetectorContextGenerator.java          | 51 +-----------
 .../tools/langdetect/LanguageDetectorFactory.java  | 16 +++-
 ...faultLanguageDetectorContextGeneratorTest.java} |  4 +-
 .../opennlp/tools/langdetect/DummyFactory.java     |  8 +-
 7 files changed, 34 insertions(+), 155 deletions(-)

diff --git a/opennlp-docs/src/docbkx/corpora.xml 
b/opennlp-docs/src/docbkx/corpora.xml
index 26c8141..aa97dc2 100644
--- a/opennlp-docs/src/docbkx/corpora.xml
+++ b/opennlp-docs/src/docbkx/corpora.xml
@@ -437,99 +437,8 @@ F-Measure: 0.7717879983140168]]>
                        </para>
                </section>
        </section>
-       <section id="tools.corpora.leipzig">
-       <title>Leipzig Corpora</title>
-       <para>
-       The Leipzig Corpora collection presents corpora in different languages. 
The corpora is a collection of individual sentences collected
-       from the web and newspapers. The Corpora is available as plain text and 
as MySQL database tables. The OpenNLP integration can only
-       use the plain text version.
-       </para>
-       <para>
-       The corpora in the different languages can be used to train a document 
categorizer model which can detect the document language. 
-       The     individual plain text packages can be downloaded here:
-       <ulink 
url="http://corpora.uni-leipzig.de/download.html";>http://corpora.uni-leipzig.de/download.html</ulink>
-       </para>
-       
-       <para>
-       After all packages have been downloaded, unzip them and use the 
following commands to
-       produce a training file which can be processed by the Document 
Categorizer:
-       <screen>
-                       <![CDATA[
-$ opennlp DoccatConverter leipzig -lang cat -data 
Leipzig/cat100k/sentences.txt >> lang.train
-$ opennlp DoccatConverter leipzig -lang de -data Leipzig/de100k/sentences.txt 
>> lang.train
-$ opennlp DoccatConverter leipzig -lang dk -data Leipzig/dk100k/sentences.txt 
>> lang.train
-$ opennlp DoccatConverter leipzig -lang ee -data Leipzig/ee100k/sentences.txt 
>> lang.train
-$ opennlp DoccatConverter leipzig -lang en -data Leipzig/en100k/sentences.txt 
>> lang.train
-$ opennlp DoccatConverter leipzig -lang fi -data Leipzig/fi100k/sentences.txt 
>> lang.train
-$ opennlp DoccatConverter leipzig -lang fr -data Leipzig/fr100k/sentences.txt 
>> lang.train
-$ opennlp DoccatConverter leipzig -lang it -data Leipzig/it100k/sentences.txt 
>> lang.train
-$ opennlp DoccatConverter leipzig -lang jp -data Leipzig/jp100k/sentences.txt 
>> lang.train
-$ opennlp DoccatConverter leipzig -lang kr -data Leipzig/kr100k/sentences.txt 
>> lang.train
-$ opennlp DoccatConverter leipzig -lang nl -data Leipzig/nl100k/sentences.txt 
>> lang.train
-$ opennlp DoccatConverter leipzig -lang no -data Leipzig/no100k/sentences.txt 
>> lang.train
-$ opennlp DoccatConverter leipzig -lang se -data Leipzig/se100k/sentences.txt 
>> lang.train
-$ opennlp DoccatConverter leipzig -lang sorb -data 
Leipzig/sorb100k/sentences.txt >> lang.train
-$ opennlp DoccatConverter leipzig -lang tr -data Leipzig/tr100k/sentences.txt 
>> lang.train]]>
-       </screen>
-       </para>
-       <para>
-       Depending on your platform local it might be problematic to output 
characters which are not supported by that encoding,
-       we suggest to run these command on a platform which has a unicode 
default encoding, e.g. Linux with UTF-8.
-       </para>
-       <para>
-       After the lang.train file is created the actual language detection 
document categorizer model
-       can be created with the following command.
-       <screen>
-                       <![CDATA[
-$ opennlp DoccatTrainer -model lang.model -lang x-unspecified -data lang.train 
-encoding MacRoman
-
-Indexing events using cutoff of 5
-
-       Computing event counts...  done. 10000 events
-       Indexing...  done.
-Sorting and merging events... done. Reduced 10000 events to 10000.
-Done indexing.
-Incorporating indexed data for training...  
-done.
-       Number of Event Tokens: 10000
-           Number of Outcomes: 2
-         Number of Predicates: 42730
-...done.
-Computing model parameters...
-Performing 100 iterations.
-  1:  .. loglikelihood=-6931.471805600547      0.5
-  2:  .. loglikelihood=-2110.9654348555955     1.0
-... cut lots of iterations ...
-
- 99:  .. loglikelihood=-0.449640418555347      1.0
-100:  .. loglikelihood=-0.443746359746235      1.0
-Writing document categorizer model ... done (1.210s)
 
-Wrote document categorizer model to
-path: /Users/joern/dev/opennlp-apache/opennlp/opennlp-tools/lang.model
-]]>
-       </screen>
-       In the sample above the language detection model was trained to 
distinguish two languages, danish and english.
-       </para>
-       
-       <para>
-       After the model is created it can be used to detect the two languages:
-       
-       <programlisting>
-                       <![CDATA[
-$ bin/opennlp Doccat ../lang.
-lang.model  lang.train  
-karkand:opennlp-tools joern$ bin/opennlp Doccat ../lang.model
-Loading Document Categorizer model ... done (0.289s)
-The American Finance Association is pleased to announce the award of ...
-en     The American Finance Association is pleased to announce the award of ..
-.
-Danskerne skal betale for den økonomiske krise ved at blive længere på 
arbejdsmarkedet .
-dk     Danskerne skal betale for den økonomiske krise ved at blive længere på 
arbejdsmarkedet .]]>     
-       </programlisting>
-       </para>
-       </section>
-               <section id="tools.corpora.ontonotes">
+       <section id="tools.corpora.ontonotes">
                <title>OntoNotes Release 4.0</title>
        <para>
                "OntoNotes Release 4.0, Linguistic Data Consortium (LDC) 
catalog number
diff --git a/opennlp-docs/src/docbkx/langdetect.xml 
b/opennlp-docs/src/docbkx/langdetect.xml
index 9f170ce..67412a4 100644
--- a/opennlp-docs/src/docbkx/langdetect.xml
+++ b/opennlp-docs/src/docbkx/langdetect.xml
@@ -162,9 +162,9 @@ $ bin/opennlp LanguageDetectorTrainer[.leipzig] -model 
modelFile [-params params
                        <para>
                                The Leipzig Corpora collection presents corpora 
in different languages. The corpora is a collection
                                of individual sentences collected from the web 
and newspapers. The Corpora is available as plain text
-                               and as MySQL database tables. The OpenNLP 
integration can only use the plain text version. More
-                               information about the corpora and how to 
download can be found in the
-                               <link linkend="tools.corpora.leipzig">Corpora 
section</link>.
+                               and as MySQL database tables. The OpenNLP 
integration can only use the plain text version.
+                               The     individual plain text packages can be 
downloaded here:
+                               <ulink 
url="http://corpora.uni-leipzig.de/download.html";>http://corpora.uni-leipzig.de/download.html</ulink>
                        </para>
                        <para>
                                This corpora is specially good to train 
Language Detector and a converter is provided. First, you need to
diff --git 
a/opennlp-tools/src/main/java/opennlp/tools/langdetect/LanguageDetectorContextGenerator.java
 
b/opennlp-tools/src/main/java/opennlp/tools/langdetect/DefaultLanguageDetectorContextGenerator.java
similarity index 83%
copy from 
opennlp-tools/src/main/java/opennlp/tools/langdetect/LanguageDetectorContextGenerator.java
copy to 
opennlp-tools/src/main/java/opennlp/tools/langdetect/DefaultLanguageDetectorContextGenerator.java
index f0941df..41f9490 100644
--- 
a/opennlp-tools/src/main/java/opennlp/tools/langdetect/LanguageDetectorContextGenerator.java
+++ 
b/opennlp-tools/src/main/java/opennlp/tools/langdetect/DefaultLanguageDetectorContextGenerator.java
@@ -28,21 +28,21 @@ import opennlp.tools.util.normalizer.CharSequenceNormalizer;
 /**
  * A context generator for language detector.
  */
-class LanguageDetectorContextGenerator {
+public class DefaultLanguageDetectorContextGenerator implements 
LanguageDetectorContextGenerator {
 
   protected final int minLength;
   protected final int maxLength;
   protected final CharSequenceNormalizer normalizer;
 
   /**
-   * Creates a customizable @{@link LanguageDetectorContextGenerator} that 
computes ngrams from text
+   * Creates a customizable @{@link DefaultLanguageDetectorContextGenerator} 
that computes ngrams from text
    * @param minLength min ngrams chars
    * @param maxLength max ngrams chars
    * @param normalizers zero or more normalizers to
    *                    be applied in to the text before extracting ngrams
    */
-  public LanguageDetectorContextGenerator(int minLength, int maxLength,
-                                          CharSequenceNormalizer... 
normalizers) {
+  public DefaultLanguageDetectorContextGenerator(int minLength, int maxLength,
+                                                 CharSequenceNormalizer... 
normalizers) {
     this.minLength = minLength;
     this.maxLength = maxLength;
 
@@ -54,7 +54,8 @@ class LanguageDetectorContextGenerator {
    * @param document document to extract context from
    * @return the generated context
    */
-  public String[] getContext(String document) {
+  @Override
+  public String[] getContext(CharSequence document) {
     Collection<String> context = new ArrayList<>();
 
     NGramModel model = new NGramModel();
diff --git 
a/opennlp-tools/src/main/java/opennlp/tools/langdetect/LanguageDetectorContextGenerator.java
 
b/opennlp-tools/src/main/java/opennlp/tools/langdetect/LanguageDetectorContextGenerator.java
index f0941df..0d6267b 100644
--- 
a/opennlp-tools/src/main/java/opennlp/tools/langdetect/LanguageDetectorContextGenerator.java
+++ 
b/opennlp-tools/src/main/java/opennlp/tools/langdetect/LanguageDetectorContextGenerator.java
@@ -17,54 +17,9 @@
 
 package opennlp.tools.langdetect;
 
-import java.util.ArrayList;
-import java.util.Collection;
-
-import opennlp.tools.ngram.NGramModel;
-import opennlp.tools.util.StringList;
-import opennlp.tools.util.normalizer.AggregateCharSequenceNormalizer;
-import opennlp.tools.util.normalizer.CharSequenceNormalizer;
-
 /**
- * A context generator for language detector.
+ * A context generator interface for language detector.
  */
-class LanguageDetectorContextGenerator {
-
-  protected final int minLength;
-  protected final int maxLength;
-  protected final CharSequenceNormalizer normalizer;
-
-  /**
-   * Creates a customizable @{@link LanguageDetectorContextGenerator} that 
computes ngrams from text
-   * @param minLength min ngrams chars
-   * @param maxLength max ngrams chars
-   * @param normalizers zero or more normalizers to
-   *                    be applied in to the text before extracting ngrams
-   */
-  public LanguageDetectorContextGenerator(int minLength, int maxLength,
-                                          CharSequenceNormalizer... 
normalizers) {
-    this.minLength = minLength;
-    this.maxLength = maxLength;
-
-    this.normalizer = new AggregateCharSequenceNormalizer(normalizers);
-  }
-
-  /**
-   * Generates the context for a document using character ngrams.
-   * @param document document to extract context from
-   * @return the generated context
-   */
-  public String[] getContext(String document) {
-    Collection<String> context = new ArrayList<>();
-
-    NGramModel model = new NGramModel();
-    model.add(normalizer.normalize(document), minLength, maxLength);
-
-    for (StringList tokenList : model) {
-      if (tokenList.size() > 0) {
-        context.add(tokenList.getToken(0));
-      }
-    }
-    return context.toArray(new String[context.size()]);
-  }
+public interface LanguageDetectorContextGenerator {
+  String[] getContext(CharSequence document);
 }
diff --git 
a/opennlp-tools/src/main/java/opennlp/tools/langdetect/LanguageDetectorFactory.java
 
b/opennlp-tools/src/main/java/opennlp/tools/langdetect/LanguageDetectorFactory.java
index 11357ec..422a98d 100644
--- 
a/opennlp-tools/src/main/java/opennlp/tools/langdetect/LanguageDetectorFactory.java
+++ 
b/opennlp-tools/src/main/java/opennlp/tools/langdetect/LanguageDetectorFactory.java
@@ -27,10 +27,24 @@ import 
opennlp.tools.util.normalizer.TwitterCharSequenceNormalizer;
 import opennlp.tools.util.normalizer.UrlCharSequenceNormalizer;
 
 
+/**
+ * <p>Default factory used by Language Detector. Extend this class to change 
the Language Detector
+ * behaviour, such as the {@link LanguageDetectorContextGenerator}.</p>
+ * <p>The default {@link DefaultLanguageDetectorContextGenerator} will use 
char n-grams of
+ * size 1 to 3 and the following normalizers:
+ * <ul>
+ * <li> {@link EmojiCharSequenceNormalizer}
+ * <li> {@link UrlCharSequenceNormalizer}
+ * <li> {@link TwitterCharSequenceNormalizer}
+ * <li> {@link NumberCharSequenceNormalizer}
+ * <li> {@link ShrinkCharSequenceNormalizer}
+ * </ul>
+ * </p>
+ */
 public class LanguageDetectorFactory extends BaseToolFactory {
 
   public LanguageDetectorContextGenerator getContextGenerator() {
-    return new LanguageDetectorContextGenerator(1, 3,
+    return new DefaultLanguageDetectorContextGenerator(1, 3,
         EmojiCharSequenceNormalizer.getInstance(),
         UrlCharSequenceNormalizer.getInstance(),
         TwitterCharSequenceNormalizer.getInstance(),
diff --git 
a/opennlp-tools/src/test/java/opennlp/tools/langdetect/LanguageDetectorContextGeneratorTest.java
 
b/opennlp-tools/src/test/java/opennlp/tools/langdetect/DefaultLanguageDetectorContextGeneratorTest.java
similarity index 89%
rename from 
opennlp-tools/src/test/java/opennlp/tools/langdetect/LanguageDetectorContextGeneratorTest.java
rename to 
opennlp-tools/src/test/java/opennlp/tools/langdetect/DefaultLanguageDetectorContextGeneratorTest.java
index dc6ca26..29f45a5 100644
--- 
a/opennlp-tools/src/test/java/opennlp/tools/langdetect/LanguageDetectorContextGeneratorTest.java
+++ 
b/opennlp-tools/src/test/java/opennlp/tools/langdetect/DefaultLanguageDetectorContextGeneratorTest.java
@@ -24,13 +24,13 @@ import org.junit.Assert;
 import org.junit.Test;
 
 
-public class LanguageDetectorContextGeneratorTest {
+public class DefaultLanguageDetectorContextGeneratorTest {
 
   @Test
   public void extractContext() throws Exception {
     String doc = "abcde fghijk";
 
-    LanguageDetectorContextGenerator cg = new 
LanguageDetectorContextGenerator(1, 3);
+    LanguageDetectorContextGenerator cg = new 
DefaultLanguageDetectorContextGenerator(1, 3);
 
     Collection<String> features = Arrays.asList(cg.getContext(doc));
 
diff --git 
a/opennlp-tools/src/test/java/opennlp/tools/langdetect/DummyFactory.java 
b/opennlp-tools/src/test/java/opennlp/tools/langdetect/DummyFactory.java
index 1aae887..7e3a9da 100644
--- a/opennlp-tools/src/test/java/opennlp/tools/langdetect/DummyFactory.java
+++ b/opennlp-tools/src/test/java/opennlp/tools/langdetect/DummyFactory.java
@@ -53,22 +53,22 @@ public class DummyFactory extends LanguageDetectorFactory {
     }
   }
 
-  public class MyContectGenerator extends LanguageDetectorContextGenerator {
+  public class MyContectGenerator extends 
DefaultLanguageDetectorContextGenerator {
 
     public MyContectGenerator(int min, int max, CharSequenceNormalizer... 
normalizers) {
       super(min, max, normalizers);
     }
 
     @Override
-    public String[] getContext(String document) {
+    public String[] getContext(CharSequence document) {
       String[] superContext = super.getContext(document);
 
       List<String> context = new ArrayList(Arrays.asList(superContext));
 
-      document = this.normalizer.normalize(document).toString();
+      document = this.normalizer.normalize(document);
 
       SimpleTokenizer tokenizer = SimpleTokenizer.INSTANCE;
-      String[] words = tokenizer.tokenize(document);
+      String[] words = tokenizer.tokenize(document.toString());
       NGramModel tokenNgramModel = new NGramModel();
       if (words.length > 0) {
         tokenNgramModel.add(new StringList(words), 1, 3);

-- 
To stop receiving notification emails like this one, please contact
['"[email protected]" <[email protected]>'].

[opennlp] branch master updated: OPENNLP-1110: Changes visibility of LanguageDetectorContextGenerator Also it extracts an interface for it and update documentation.

Reply via email to