This is an automated email from the ASF dual-hosted git repository.
colen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/opennlp.git
The following commit(s) were added to refs/heads/master by this push:
new 3e8b8aa OPENNLP-1110: Changes visibility of
LanguageDetectorContextGenerator Also it extracts an interface for it and
update documentation.
3e8b8aa is described below
commit 3e8b8aa06bb7742cd1643fad72d7739cda400010
Author: William D C M SILVA <[email protected]>
AuthorDate: Wed Jul 5 08:30:28 2017 -0300
OPENNLP-1110: Changes visibility of LanguageDetectorContextGenerator
Also it extracts an interface for it and update documentation.
---
opennlp-docs/src/docbkx/corpora.xml | 93 +---------------------
opennlp-docs/src/docbkx/langdetect.xml | 6 +-
...> DefaultLanguageDetectorContextGenerator.java} | 11 +--
.../LanguageDetectorContextGenerator.java | 51 +-----------
.../tools/langdetect/LanguageDetectorFactory.java | 16 +++-
...faultLanguageDetectorContextGeneratorTest.java} | 4 +-
.../opennlp/tools/langdetect/DummyFactory.java | 8 +-
7 files changed, 34 insertions(+), 155 deletions(-)
diff --git a/opennlp-docs/src/docbkx/corpora.xml
b/opennlp-docs/src/docbkx/corpora.xml
index 26c8141..aa97dc2 100644
--- a/opennlp-docs/src/docbkx/corpora.xml
+++ b/opennlp-docs/src/docbkx/corpora.xml
@@ -437,99 +437,8 @@ F-Measure: 0.7717879983140168]]>
</para>
</section>
</section>
- <section id="tools.corpora.leipzig">
- <title>Leipzig Corpora</title>
- <para>
- The Leipzig Corpora collection presents corpora in different languages.
The corpora is a collection of individual sentences collected
- from the web and newspapers. The Corpora is available as plain text and
as MySQL database tables. The OpenNLP integration can only
- use the plain text version.
- </para>
- <para>
- The corpora in the different languages can be used to train a document
categorizer model which can detect the document language.
- The individual plain text packages can be downloaded here:
- <ulink
url="http://corpora.uni-leipzig.de/download.html">http://corpora.uni-leipzig.de/download.html</ulink>
- </para>
-
- <para>
- After all packages have been downloaded, unzip them and use the
following commands to
- produce a training file which can be processed by the Document
Categorizer:
- <screen>
- <![CDATA[
-$ opennlp DoccatConverter leipzig -lang cat -data
Leipzig/cat100k/sentences.txt >> lang.train
-$ opennlp DoccatConverter leipzig -lang de -data Leipzig/de100k/sentences.txt
>> lang.train
-$ opennlp DoccatConverter leipzig -lang dk -data Leipzig/dk100k/sentences.txt
>> lang.train
-$ opennlp DoccatConverter leipzig -lang ee -data Leipzig/ee100k/sentences.txt
>> lang.train
-$ opennlp DoccatConverter leipzig -lang en -data Leipzig/en100k/sentences.txt
>> lang.train
-$ opennlp DoccatConverter leipzig -lang fi -data Leipzig/fi100k/sentences.txt
>> lang.train
-$ opennlp DoccatConverter leipzig -lang fr -data Leipzig/fr100k/sentences.txt
>> lang.train
-$ opennlp DoccatConverter leipzig -lang it -data Leipzig/it100k/sentences.txt
>> lang.train
-$ opennlp DoccatConverter leipzig -lang jp -data Leipzig/jp100k/sentences.txt
>> lang.train
-$ opennlp DoccatConverter leipzig -lang kr -data Leipzig/kr100k/sentences.txt
>> lang.train
-$ opennlp DoccatConverter leipzig -lang nl -data Leipzig/nl100k/sentences.txt
>> lang.train
-$ opennlp DoccatConverter leipzig -lang no -data Leipzig/no100k/sentences.txt
>> lang.train
-$ opennlp DoccatConverter leipzig -lang se -data Leipzig/se100k/sentences.txt
>> lang.train
-$ opennlp DoccatConverter leipzig -lang sorb -data
Leipzig/sorb100k/sentences.txt >> lang.train
-$ opennlp DoccatConverter leipzig -lang tr -data Leipzig/tr100k/sentences.txt
>> lang.train]]>
- </screen>
- </para>
- <para>
- Depending on your platform local it might be problematic to output
characters which are not supported by that encoding,
- we suggest to run these command on a platform which has a unicode
default encoding, e.g. Linux with UTF-8.
- </para>
- <para>
- After the lang.train file is created the actual language detection
document categorizer model
- can be created with the following command.
- <screen>
- <![CDATA[
-$ opennlp DoccatTrainer -model lang.model -lang x-unspecified -data lang.train
-encoding MacRoman
-
-Indexing events using cutoff of 5
-
- Computing event counts... done. 10000 events
- Indexing... done.
-Sorting and merging events... done. Reduced 10000 events to 10000.
-Done indexing.
-Incorporating indexed data for training...
-done.
- Number of Event Tokens: 10000
- Number of Outcomes: 2
- Number of Predicates: 42730
-...done.
-Computing model parameters...
-Performing 100 iterations.
- 1: .. loglikelihood=-6931.471805600547 0.5
- 2: .. loglikelihood=-2110.9654348555955 1.0
-... cut lots of iterations ...
-
- 99: .. loglikelihood=-0.449640418555347 1.0
-100: .. loglikelihood=-0.443746359746235 1.0
-Writing document categorizer model ... done (1.210s)
-Wrote document categorizer model to
-path: /Users/joern/dev/opennlp-apache/opennlp/opennlp-tools/lang.model
-]]>
- </screen>
- In the sample above the language detection model was trained to
distinguish two languages, danish and english.
- </para>
-
- <para>
- After the model is created it can be used to detect the two languages:
-
- <programlisting>
- <![CDATA[
-$ bin/opennlp Doccat ../lang.
-lang.model lang.train
-karkand:opennlp-tools joern$ bin/opennlp Doccat ../lang.model
-Loading Document Categorizer model ... done (0.289s)
-The American Finance Association is pleased to announce the award of ...
-en The American Finance Association is pleased to announce the award of ..
-.
-Danskerne skal betale for den økonomiske krise ved at blive længere på
arbejdsmarkedet .
-dk Danskerne skal betale for den økonomiske krise ved at blive længere på
arbejdsmarkedet .]]>
- </programlisting>
- </para>
- </section>
- <section id="tools.corpora.ontonotes">
+ <section id="tools.corpora.ontonotes">
<title>OntoNotes Release 4.0</title>
<para>
"OntoNotes Release 4.0, Linguistic Data Consortium (LDC)
catalog number
diff --git a/opennlp-docs/src/docbkx/langdetect.xml
b/opennlp-docs/src/docbkx/langdetect.xml
index 9f170ce..67412a4 100644
--- a/opennlp-docs/src/docbkx/langdetect.xml
+++ b/opennlp-docs/src/docbkx/langdetect.xml
@@ -162,9 +162,9 @@ $ bin/opennlp LanguageDetectorTrainer[.leipzig] -model
modelFile [-params params
<para>
The Leipzig Corpora collection presents corpora
in different languages. The corpora is a collection
of individual sentences collected from the web
and newspapers. The Corpora is available as plain text
- and as MySQL database tables. The OpenNLP
integration can only use the plain text version. More
- information about the corpora and how to
download can be found in the
- <link linkend="tools.corpora.leipzig">Corpora
section</link>.
+ and as MySQL database tables. The OpenNLP
integration can only use the plain text version.
+ The individual plain text packages can be
downloaded here:
+ <ulink
url="http://corpora.uni-leipzig.de/download.html">http://corpora.uni-leipzig.de/download.html</ulink>
</para>
<para>
This corpora is specially good to train
Language Detector and a converter is provided. First, you need to
diff --git
a/opennlp-tools/src/main/java/opennlp/tools/langdetect/LanguageDetectorContextGenerator.java
b/opennlp-tools/src/main/java/opennlp/tools/langdetect/DefaultLanguageDetectorContextGenerator.java
similarity index 83%
copy from
opennlp-tools/src/main/java/opennlp/tools/langdetect/LanguageDetectorContextGenerator.java
copy to
opennlp-tools/src/main/java/opennlp/tools/langdetect/DefaultLanguageDetectorContextGenerator.java
index f0941df..41f9490 100644
---
a/opennlp-tools/src/main/java/opennlp/tools/langdetect/LanguageDetectorContextGenerator.java
+++
b/opennlp-tools/src/main/java/opennlp/tools/langdetect/DefaultLanguageDetectorContextGenerator.java
@@ -28,21 +28,21 @@ import opennlp.tools.util.normalizer.CharSequenceNormalizer;
/**
* A context generator for language detector.
*/
-class LanguageDetectorContextGenerator {
+public class DefaultLanguageDetectorContextGenerator implements
LanguageDetectorContextGenerator {
protected final int minLength;
protected final int maxLength;
protected final CharSequenceNormalizer normalizer;
/**
- * Creates a customizable @{@link LanguageDetectorContextGenerator} that
computes ngrams from text
+ * Creates a customizable @{@link DefaultLanguageDetectorContextGenerator}
that computes ngrams from text
* @param minLength min ngrams chars
* @param maxLength max ngrams chars
* @param normalizers zero or more normalizers to
* be applied in to the text before extracting ngrams
*/
- public LanguageDetectorContextGenerator(int minLength, int maxLength,
- CharSequenceNormalizer...
normalizers) {
+ public DefaultLanguageDetectorContextGenerator(int minLength, int maxLength,
+ CharSequenceNormalizer...
normalizers) {
this.minLength = minLength;
this.maxLength = maxLength;
@@ -54,7 +54,8 @@ class LanguageDetectorContextGenerator {
* @param document document to extract context from
* @return the generated context
*/
- public String[] getContext(String document) {
+ @Override
+ public String[] getContext(CharSequence document) {
Collection<String> context = new ArrayList<>();
NGramModel model = new NGramModel();
diff --git
a/opennlp-tools/src/main/java/opennlp/tools/langdetect/LanguageDetectorContextGenerator.java
b/opennlp-tools/src/main/java/opennlp/tools/langdetect/LanguageDetectorContextGenerator.java
index f0941df..0d6267b 100644
---
a/opennlp-tools/src/main/java/opennlp/tools/langdetect/LanguageDetectorContextGenerator.java
+++
b/opennlp-tools/src/main/java/opennlp/tools/langdetect/LanguageDetectorContextGenerator.java
@@ -17,54 +17,9 @@
package opennlp.tools.langdetect;
-import java.util.ArrayList;
-import java.util.Collection;
-
-import opennlp.tools.ngram.NGramModel;
-import opennlp.tools.util.StringList;
-import opennlp.tools.util.normalizer.AggregateCharSequenceNormalizer;
-import opennlp.tools.util.normalizer.CharSequenceNormalizer;
-
/**
- * A context generator for language detector.
+ * A context generator interface for language detector.
*/
-class LanguageDetectorContextGenerator {
-
- protected final int minLength;
- protected final int maxLength;
- protected final CharSequenceNormalizer normalizer;
-
- /**
- * Creates a customizable @{@link LanguageDetectorContextGenerator} that
computes ngrams from text
- * @param minLength min ngrams chars
- * @param maxLength max ngrams chars
- * @param normalizers zero or more normalizers to
- * be applied in to the text before extracting ngrams
- */
- public LanguageDetectorContextGenerator(int minLength, int maxLength,
- CharSequenceNormalizer...
normalizers) {
- this.minLength = minLength;
- this.maxLength = maxLength;
-
- this.normalizer = new AggregateCharSequenceNormalizer(normalizers);
- }
-
- /**
- * Generates the context for a document using character ngrams.
- * @param document document to extract context from
- * @return the generated context
- */
- public String[] getContext(String document) {
- Collection<String> context = new ArrayList<>();
-
- NGramModel model = new NGramModel();
- model.add(normalizer.normalize(document), minLength, maxLength);
-
- for (StringList tokenList : model) {
- if (tokenList.size() > 0) {
- context.add(tokenList.getToken(0));
- }
- }
- return context.toArray(new String[context.size()]);
- }
+public interface LanguageDetectorContextGenerator {
+ String[] getContext(CharSequence document);
}
diff --git
a/opennlp-tools/src/main/java/opennlp/tools/langdetect/LanguageDetectorFactory.java
b/opennlp-tools/src/main/java/opennlp/tools/langdetect/LanguageDetectorFactory.java
index 11357ec..422a98d 100644
---
a/opennlp-tools/src/main/java/opennlp/tools/langdetect/LanguageDetectorFactory.java
+++
b/opennlp-tools/src/main/java/opennlp/tools/langdetect/LanguageDetectorFactory.java
@@ -27,10 +27,24 @@ import
opennlp.tools.util.normalizer.TwitterCharSequenceNormalizer;
import opennlp.tools.util.normalizer.UrlCharSequenceNormalizer;
+/**
+ * <p>Default factory used by Language Detector. Extend this class to change
the Language Detector
+ * behaviour, such as the {@link LanguageDetectorContextGenerator}.</p>
+ * <p>The default {@link DefaultLanguageDetectorContextGenerator} will use
char n-grams of
+ * size 1 to 3 and the following normalizers:
+ * <ul>
+ * <li> {@link EmojiCharSequenceNormalizer}
+ * <li> {@link UrlCharSequenceNormalizer}
+ * <li> {@link TwitterCharSequenceNormalizer}
+ * <li> {@link NumberCharSequenceNormalizer}
+ * <li> {@link ShrinkCharSequenceNormalizer}
+ * </ul>
+ * </p>
+ */
public class LanguageDetectorFactory extends BaseToolFactory {
public LanguageDetectorContextGenerator getContextGenerator() {
- return new LanguageDetectorContextGenerator(1, 3,
+ return new DefaultLanguageDetectorContextGenerator(1, 3,
EmojiCharSequenceNormalizer.getInstance(),
UrlCharSequenceNormalizer.getInstance(),
TwitterCharSequenceNormalizer.getInstance(),
diff --git
a/opennlp-tools/src/test/java/opennlp/tools/langdetect/LanguageDetectorContextGeneratorTest.java
b/opennlp-tools/src/test/java/opennlp/tools/langdetect/DefaultLanguageDetectorContextGeneratorTest.java
similarity index 89%
rename from
opennlp-tools/src/test/java/opennlp/tools/langdetect/LanguageDetectorContextGeneratorTest.java
rename to
opennlp-tools/src/test/java/opennlp/tools/langdetect/DefaultLanguageDetectorContextGeneratorTest.java
index dc6ca26..29f45a5 100644
---
a/opennlp-tools/src/test/java/opennlp/tools/langdetect/LanguageDetectorContextGeneratorTest.java
+++
b/opennlp-tools/src/test/java/opennlp/tools/langdetect/DefaultLanguageDetectorContextGeneratorTest.java
@@ -24,13 +24,13 @@ import org.junit.Assert;
import org.junit.Test;
-public class LanguageDetectorContextGeneratorTest {
+public class DefaultLanguageDetectorContextGeneratorTest {
@Test
public void extractContext() throws Exception {
String doc = "abcde fghijk";
- LanguageDetectorContextGenerator cg = new
LanguageDetectorContextGenerator(1, 3);
+ LanguageDetectorContextGenerator cg = new
DefaultLanguageDetectorContextGenerator(1, 3);
Collection<String> features = Arrays.asList(cg.getContext(doc));
diff --git
a/opennlp-tools/src/test/java/opennlp/tools/langdetect/DummyFactory.java
b/opennlp-tools/src/test/java/opennlp/tools/langdetect/DummyFactory.java
index 1aae887..7e3a9da 100644
--- a/opennlp-tools/src/test/java/opennlp/tools/langdetect/DummyFactory.java
+++ b/opennlp-tools/src/test/java/opennlp/tools/langdetect/DummyFactory.java
@@ -53,22 +53,22 @@ public class DummyFactory extends LanguageDetectorFactory {
}
}
- public class MyContectGenerator extends LanguageDetectorContextGenerator {
+ public class MyContectGenerator extends
DefaultLanguageDetectorContextGenerator {
public MyContectGenerator(int min, int max, CharSequenceNormalizer...
normalizers) {
super(min, max, normalizers);
}
@Override
- public String[] getContext(String document) {
+ public String[] getContext(CharSequence document) {
String[] superContext = super.getContext(document);
List<String> context = new ArrayList(Arrays.asList(superContext));
- document = this.normalizer.normalize(document).toString();
+ document = this.normalizer.normalize(document);
SimpleTokenizer tokenizer = SimpleTokenizer.INSTANCE;
- String[] words = tokenizer.tokenize(document);
+ String[] words = tokenizer.tokenize(document.toString());
NGramModel tokenNgramModel = new NGramModel();
if (words.length > 0) {
tokenNgramModel.add(new StringList(words), 1, 3);
--
To stop receiving notification emails like this one, please contact
['"[email protected]" <[email protected]>'].