This is an automated email from the ASF dual-hosted git repository.
mawiesne pushed a commit to branch
experimental/cleanup-dependency-mess-of-opennlp-similarity
in repository https://gitbox.apache.org/repos/asf/opennlp-sandbox.git
The following commit(s) were added to
refs/heads/experimental/cleanup-dependency-mess-of-opennlp-similarity by this
push:
new 7c5426e reorganizes dependencies of 'opennlp-similarity' component
switches 'tika-app' to more lightweight 'tika-core' dep switches 'docx4j' to
more lightweight / modern 'docx4j-core' dep (11.5.1, jakarta) switches to
ud-models in opennlp-similarity component uses thread-safe Tokenizer, POSTagger
and SentenceDetector impl classes to avoid race conditions, as shown by JUnit
tests sometimes adapts README.md
new 9bd516b Merge remote-tracking branch
'origin/experimental/cleanup-dependency-mess-of-opennlp-similarity' into
experimental/cleanup-dependency-mess-of-opennlp-similarity
7c5426e is described below
commit 7c5426ea886f07de468b4a99c32a1b2473db17ec
Author: Martin Wiesner <[email protected]>
AuthorDate: Tue Dec 10 11:26:20 2024 +0100
reorganizes dependencies of 'opennlp-similarity' component
switches 'tika-app' to more lightweight 'tika-core' dep
switches 'docx4j' to more lightweight / modern 'docx4j-core' dep (11.5.1,
jakarta)
switches to ud-models in opennlp-similarity component
uses thread-safe Tokenizer, POSTagger and SentenceDetector impl classes to
avoid race conditions, as shown by JUnit tests sometimes
adapts README.md
---
opennlp-similarity/README.md | 134 +++++++-------
opennlp-similarity/pom.xml | 84 +++++----
.../review_builder/FBOpenGraphSearchManager.java | 148 ---------------
.../review_builder/WebPageReviewExtractor.java | 2 -
.../tools/apps/utils/email/EmailSender.java | 26 +--
.../tools/apps/utils/email/SMTPAuthenticator.java | 4 +-
.../tools/doc_classifier/DocClassifier.java | 12 +-
...cClassifierTrainingSetMultilingualExtender.java | 6 +-
.../DocClassifierTrainingSetVerifier.java | 4 +-
.../enron_email_recognizer/EmailNormalizer.java | 13 +-
.../EmailTrainingSetFormer.java | 9 +-
.../main/java/opennlp/tools/nl2code/NL2Obj.java | 13 +-
.../apps/MostFrequentWordsFromPageGetter.java | 31 ++--
.../similarity/apps/ContentGeneratorRunner.java | 21 +--
.../tools/similarity/apps/solr/CommentsRel.java | 2 +-
.../apps/solr/ContentGeneratorRequestHandler.java | 51 +-----
.../solr/SearchResultsReRankerRequestHandler.java | 26 +--
.../apps/solr/WordDocBuilderEndNotes.java | 45 ++---
.../ParserChunker2MatcherProcessor.java | 201 ++++++++-------------
.../ParserPure2MatcherProcessor.java | 60 +++---
.../src/test/resources/models/en-sent.bin | Bin 98533 -> 0 bytes
pom.xml | 18 +-
22 files changed, 329 insertions(+), 581 deletions(-)
diff --git a/opennlp-similarity/README.md b/opennlp-similarity/README.md
index 7153beb..d296e46 100644
--- a/opennlp-similarity/README.md
+++ b/opennlp-similarity/README.md
@@ -6,51 +6,49 @@ It is leveraged in search, content generation & enrichment,
chatbots and other t
## What is OpenNLP.Similarity?
OpenNLP.Similarity is an NLP engine which solves a number of text processing
and search tasks based on OpenNLP and Stanford NLP parsers. It is designed to
be used by a non-linguist software engineer to build linguistically-enabled:
-<ul>
-<li>search engines</li>
-<li>recommendation systems</li>
-<li>dialogue systems</li>
-<li>text analysis and semantic processing engines</li>
-<li>data-loss prevention system</li>
-<li>content & document generation tools</li>
-<li>text writing style, authenticity, sentiment, sensitivity to sharing
recognizers</li>
-<li>general-purpose deterministic inductive learner equipped with abductive,
deductive and analogical reasoning which also embraces concept learning and
tree kernel learning. </li>
-</ul>
+
+- search engines
+- recommendation systems
+- dialogue systems
+- text analysis and semantic processing engines
+- data-loss prevention system
+- content & document generation tools
+- text writing style, authenticity, sentiment, sensitivity to sharing
recognizers
+- general-purpose deterministic inductive learner equipped with abductive,
deductive and analogical reasoning which also embraces concept learning and
tree kernel learning.
OpenNLP similarity provides a series of techniques to support the overall
content pipeline, from text collection to cleaning, classification,
personalization and distribution. Technology and implementation of content
pipeline developed at eBay is described
[here](https://github.com/bgalitsky/relevance-based-on-parse-trees/tree/master/examples/ContentPipeline.pdf).
## Installation
- 0) Do [`git
clone`](https://github.com/bgalitsky/relevance-based-on-parse-trees.git) to set
up the environment including resources. Besides what you get from git,
`/resources` directory requires some additional work:
-
- 1) Download the main
[jar](https://github.com/bgalitsky/relevance-based-on-parse-trees/blob/master/opennlp-similarity.11.jar).
-
- 2) Set all necessary jars in /lib folder. Larger size jars are not on git so
please download them from [Stanford NLP site](http://nlp.stanford.edu/)
- <li>edu.mit.jverbnet-1.2.0.jar</li>
- <li>ejml-0.23.jar</li>
- <li>joda-time.jar</li>
- <li>jollyday.jar</li>
- <li>stanford-corenlp-3.5.2-models.jar</li>
- <li>xom.jar</li>
+0. Do [`git
clone`](https://github.com/bgalitsky/relevance-based-on-parse-trees.git) to set
up the environment including resources. Besides what you get from git,
`/resources` directory requires some additional work:
+
+1. Download the main
[jar](https://github.com/bgalitsky/relevance-based-on-parse-trees/blob/master/opennlp-similarity.11.jar).
+
+2. Set all necessary jars in /lib folder. Larger size jars are not on git so
please download them from [Stanford NLP site](http://nlp.stanford.edu/)
+ - edu.mit.jverbnet-1.2.0.jar
+ - ejml-0.23.jar
+ - joda-time.jar
+ - jollyday.jar
+ - stanford-corenlp-3.5.2-models.jar
+ - xom.jar
The rest of jars are available via maven.
-
- 3) Set up src/test/resources directory
- - new_vn.zip needs to be unzipped
- - OpenNLP models need to be downloaded into the directory 'models' from
[here](http://opennlp.sourceforge.net/models-1.5/)
+
+3. Set up src/test/resources directory
+ - new_vn.zip needs to be unzipped
As a result the following folders should be in /resources:
As obtained [from
git](https://github.com/bgalitsky/relevance-based-on-parse-trees/tree/master/src/test/resources):
- <li>/new_vn (VerbNet)</li>
- <li>/maps (some lookup files such as products, brands, first names etc.)</li>
- <li>/external_rst (examples of import of rhetoric parses from other
systems)</li>
- <li>/fca (Formal Concept Analysis learning)</li>
- <li>/taxonomies (for search support, taxonomies are auto-mined from the
web)</li>
- <li>/tree_kernel (for tree kernel learning, representation of parse trees,
thickets and trained models)</li>
+ - /new_vn (VerbNet)
+ - /maps (some lookup files such as products, brands, first names etc.)
+ - /external_rst (examples of import of rhetoric parses from other systems)
+ - /fca (Formal Concept Analysis learning)
+ - /taxonomies (for search support, taxonomies are auto-mined from the web)
+ - /tree_kernel (for tree kernel learning, representation of parse trees,
thickets and trained models)
Manual downloading is also required for:
- <li>/new_vn</li>
- <li>/w2v (where word2vector model needs to be downloaded, if desired)</li>
-
- 4) Try running tests which will give you a hint on how to integrate
OpenNLP.Similarity functionality into your application. You can start with
[Matcher
test](https://github.com/bgalitsky/relevance-based-on-parse-trees/blob/949bac8c2a41c21a1e54fec075f2966d693114a4/src/test/java/opennlp/tools/parse_thicket/matching/PTMatcherTest.java)
and observe how long paragraphs can be linguistically matched (you can compare
this with just an intersection of keywords)
+ - /new_vn
+ - /w2v (where word2vector model needs to be downloaded, if desired)
- 5) Look at [example
POMs](https://github.com/bgalitsky/relevance-based-on-parse-trees/tree/master/examples)
for how to better integrate into your existing project
+4. Try running tests which will give you a hint on how to integrate
OpenNLP.Similarity functionality into your application. You can start with
[Matcher
test](https://github.com/bgalitsky/relevance-based-on-parse-trees/blob/949bac8c2a41c21a1e54fec075f2966d693114a4/src/test/java/opennlp/tools/parse_thicket/matching/PTMatcherTest.java)
and observe how long paragraphs can be linguistically matched (you can compare
this with just an intersection of keywords)
+
+5. Look at [example
POMs](https://github.com/bgalitsky/relevance-based-on-parse-trees/tree/master/examples)
for how to better integrate into your existing project
## Creating a simple project
@@ -72,55 +70,54 @@ To avoid reparsing the same strings and improve the speed,
use
It operates on the level of sentences (giving [maximal common
subtree](https://github.com/bgalitsky/relevance-based-on-parse-trees/tree/master/examples/Inferring_sem_prop_of_sentences.pdf))
and paragraphs (giving maximal common [sub-parse
thicket](https://en.wikipedia.org/wiki/Parse_Thicket)). Maximal common
sub-parse thicket is also represented as a [list of common
phrases](https://github.com/bgalitsky/relevance-based-on-parse-trees/tree/master/examples/MachineLearningSyntParseTreesGali
[...]
-<li>Search results re-ranker based on linguistic similarity</li>
-<li>Request Handler for SOLR which used parse tree similarity</li>
+- Search results re-ranker based on linguistic similarity
+- Request Handler for SOLR which used parse tree similarity
### Search engine
The following set of functionalities is available to enable search with
linguistic features. It is desirable when query is long (more than 4 keywords),
logically complex, ambiguous or
-<li>Search results re-ranker based on linguistic similarity</li>
-<li>Request Handler for SOLR which used parse tree similarity</li>
-<li>Taxonomy builder via learning from the web</li>
-<li>Appropriate rhetoric map of an answer verifier. If parts of the answer are
located in distinct discourse units, this answer might be irrelevant even if
all keywords are mapped</li>
-<li>Tree kernel learning re-ranker to improve search relevance within a given
domain with pre-trained model</li>
+- Search results re-ranker based on linguistic similarity
+- Request Handler for SOLR which used parse tree similarity
+- Taxonomy builder via learning from the web
+- Appropriate rhetoric map of an answer verifier. If parts of the answer are
located in distinct discourse units, this answer might be irrelevant even if
all keywords are mapped
+- Tree kernel learning re-ranker to improve search relevance within a given
domain with pre-trained model
SOLR request handlers are available
[here](https://github.com/bgalitsky/relevance-based-on-parse-trees/tree/master/src/main/java/opennlp/tools/similarity/apps/solr)
Taxonomy builder is
[here](https://github.com/bgalitsky/relevance-based-on-parse-trees/tree/master/src/main/java/opennlp/tools/similarity/apps/taxo_builder).
- Examples of pre-built taxonomy are available in [this
directory](https://github.com/bgalitsky/relevance-based-on-parse-trees/tree/master/src/test/resources/taxonomies).
Please pay attention at taxonomies built for languages other than English. A
[music
taxonomy](https://github.com/bgalitsky/relevance-based-on-parse-trees/blob/master/src/test/resources/taxonomies/musicTaxonomyRoot.csv)
is an example of the seed data for taxonomy building, and [this taxonomy
hashmap dump](https://github.c [...]
+Examples of pre-built taxonomy are available in [this
directory](https://github.com/bgalitsky/relevance-based-on-parse-trees/tree/master/src/test/resources/taxonomies).
Please pay attention at taxonomies built for languages other than English. A
[music
taxonomy](https://github.com/bgalitsky/relevance-based-on-parse-trees/blob/master/src/test/resources/taxonomies/musicTaxonomyRoot.csv)
is an example of the seed data for taxonomy building, and [this taxonomy
hashmap dump](https://github.co [...]
#### Search results re-ranker
-Re-ranking scores similarity between a given `orderedListOfAnswers` and
`question`
-
- `List<Pair<String,Double>> pairList = new ArrayList<Pair<String,Double>>();`
-
- `for (String ans: orderedListOfAnswers) {`
+Re-ranking scores similarity between a given `orderedListOfAnswers` and
`question`:
+
+```
+ List<Pair<String,Double>> pairList = new ArrayList<Pair<String,Double>>();
- `List<List<ParseTreeChunk>> similarityResult =
m.assessRelevanceCache(question, ans);`
-
- `double score =
parseTreeChunkListScorer.getParseTreeChunkListScoreAggregPhraseType(similarityResult);`
-
- `Pair<String,Double> p = new Pair<String, Double>(ans, score);`
-
- `pairList.add(p);`
-
- `}`
+ for (String ans: orderedListOfAnswers) {
+
+ List<List<ParseTreeChunk>> similarityResult =
m.assessRelevanceCache(question, ans);
+ double score =
parseTreeChunkListScorer.getParseTreeChunkListScoreAggregPhraseType(similarityResult);
+ Pair<String,Double> p = new Pair<String, Double>(ans, score);
+ pairList.add(p);
+ }
- `Collections.sort(pairList, Comparator.comparing(p -> p.getSecond()));`
+ Collections.sort(pairList, Comparator.comparing(p -> p.getSecond()));
+```
Then `pairList` is then ranked according to the linguistic relevance score.
This score can be combined with other sources such as popularity, geo-proximity
and others.
### Content generator
- It takes a topic, builds a taxonomy for it and forms a table of content. It
then mines the web for documents for each table of content item, finds
relevant sentences and paragraphs and merges them into a document
[package](https://github.com/bgalitsky/relevance-based-on-parse-trees/tree/master/src/main/java/opennlp/tools/similarity/apps).
The resultant document has a TOC, sections, figures & captions and also a
reference section. We attempt to reproduce how humans cut-and-paste content
[...]
- Content generation has a [demo](http://37.46.135.20/) and to run it from
IDE start
[here](https://github.com/bgalitsky/relevance-based-on-parse-trees/blob/master/src/main/java/opennlp/tools/similarity/apps/ContentGeneratorRunner.java).
Examples of written documents are [here](http://37.46.135.20/wrt_latest/).
- Another content generation option is about opinion data. Reviews are mined
for, cross-bred and made "original" for search engines. This and general
content generation is done for SEO purposes. [Review
builder](https://github.com/bgalitsky/relevance-based-on-parse-trees/tree/master/src/main/java/opennlp/tools/apps/review_builder/ReviewBuilderRunner.java)
composes fake reviews which are in turn should be recognized by a Fake Review
detector
+It takes a topic, builds a taxonomy for it and forms a table of content. It
then mines the web for documents for each table of content item, finds
relevant sentences and paragraphs and merges them into a document
[package](https://github.com/bgalitsky/relevance-based-on-parse-trees/tree/master/src/main/java/opennlp/tools/similarity/apps).
The resultant document has a TOC, sections, figures & captions and also a
reference section. We attempt to reproduce how humans cut-and-paste content
[...]
+Content generation has a [demo](http://37.46.135.20/) and to run it from IDE
start
[here](https://github.com/bgalitsky/relevance-based-on-parse-trees/blob/master/src/main/java/opennlp/tools/similarity/apps/ContentGeneratorRunner.java).
Examples of written documents are [here](http://37.46.135.20/wrt_latest/).
+
+Another content generation option is about opinion data. Reviews are mined
for, cross-bred and made "original" for search engines. This and general
content generation is done for SEO purposes. [Review
builder](https://github.com/bgalitsky/relevance-based-on-parse-trees/tree/master/src/main/java/opennlp/tools/apps/review_builder/ReviewBuilderRunner.java)
composes fake reviews which are in turn should be recognized by a Fake Review
detector
### Text classifier / feature detector in text
The [classifier
code](https://github.com/bgalitsky/relevance-based-on-parse-trees/blob/master/src/main/java/opennlp/tools/parse_thicket/kernel_interface/TreeKernelBasedClassifierMultiplePara.java)
is the same but the [model
files](https://github.com/bgalitsky/relevance-based-on-parse-trees/tree/master/src/test/resources/tree_kernel/TRAINING)
vary for the applications below:
-<li>detect security leaks
-<li>detect argumentation
-<li>detect low cohesiveness in text
-<li>detect authors’ doubt and low confidence
-<li>detect fake review
+- detect security leaks
+- detect argumentation
+- detect low cohesiveness in text
+- detect authors’ doubt and low confidence
+- detect fake review
Document classification to six major classes {finance, business, legal,
computing, engineering, health} is available via [nearest neighbor
model](https://github.com/bgalitsky/relevance-based-on-parse-trees/tree/master/src/main/java/opennlp/tools/doc_classifier/DocClassifier.java).
A Lucene training model (1G file) is obtained from Wikipedia corpus. This
classifier can be trained for an arbitrary classes once respective Wiki pages
are selected and respective [Lucene index is built](https: [...]
@@ -135,8 +132,7 @@ Document classification to six major classes {finance,
business, legal, computin
To do model building and predictions, C modules are run in [this
directory](https://github.com/bgalitsky/relevance-based-on-parse-trees/tree/master/src/test/resources/tree_kernel),
so proper choice need to be made: {svm_classify.linux, svm_classify.max,
svm_classify.exe, svm_learn.*}. Also, proper run permissions needs to be set
for these files.
#### Concept learning
-
- is a branch of deterministic learning which is applied to attribute-value
pairs and possesses useful explainability feature, unlike statistical and deep
learning. It is fairly useful for data exploration and visualization since all
interesting relations can be visualized.
+.. is a branch of deterministic learning which is applied to attribute-value
pairs and possesses useful explainability feature, unlike statistical and deep
learning. It is fairly useful for data exploration and visualization since all
interesting relations can be visualized.
Concept learning covers inductive and abductive learning and also some
cases of deduction. Explore [this
package](https://github.com/bgalitsky/relevance-based-on-parse-trees/tree/master/src/main/java/opennlp/tools/fca)
for the concept learning-related features.
### Filtering results for Speech Recognition based on semantic meaningfulness
diff --git a/opennlp-similarity/pom.xml b/opennlp-similarity/pom.xml
index 58dd8a2..b10aa48 100644
--- a/opennlp-similarity/pom.xml
+++ b/opennlp-similarity/pom.xml
@@ -27,6 +27,12 @@
<name>Apache OpenNLP Similarity distribution</name>
<properties>
+ <jakarta.bind-api.version>4.0.2</jakarta.bind-api.version>
+ <jakarta.mail.version>2.1.3</jakarta.mail.version>
+
+ <tika.version>3.0.0</tika.version>
+ <solr.version>8.11.3</solr.version>
+ <docx4j.version>11.5.1</docx4j.version>
<dl4j.version>1.0.0-M2.1</dl4j.version>
<hdf5.version>1.14.3-1.5.10</hdf5.version>
<javacpp.version>1.5.11</javacpp.version>
@@ -83,27 +89,24 @@
<groupId>org.apache.opennlp</groupId>
<artifactId>opennlp-tools</artifactId>
</dependency>
-
<dependency>
- <groupId>org.slf4j</groupId>
- <artifactId>slf4j-api</artifactId>
+ <groupId>org.apache.commons</groupId>
+ <artifactId>commons-math3</artifactId>
</dependency>
-
<dependency>
- <groupId>commons-lang</groupId>
- <artifactId>commons-lang</artifactId>
+ <groupId>commons-io</groupId>
+ <artifactId>commons-io</artifactId>
+ <scope>runtime</scope>
</dependency>
<dependency>
- <groupId>commons-codec</groupId>
- <artifactId>commons-codec</artifactId>
+ <groupId>jakarta.xml.bind</groupId>
+ <artifactId>jakarta.xml.bind-api</artifactId>
+ <version>${jakarta.bind-api.version}</version>
</dependency>
<dependency>
- <groupId>commons-collections</groupId>
- <artifactId>commons-collections</artifactId>
- </dependency>
- <dependency>
- <groupId>org.apache.commons</groupId>
- <artifactId>commons-math3</artifactId>
+ <groupId>jakarta.mail</groupId>
+ <artifactId>jakarta.mail-api</artifactId>
+ <version>${jakarta.mail.version}</version>
</dependency>
<dependency>
<groupId>org.json</groupId>
@@ -112,19 +115,20 @@
</dependency>
<dependency>
<groupId>org.apache.tika</groupId>
- <artifactId>tika-app</artifactId>
- <version>3.0.0</version>
+ <artifactId>tika-core</artifactId>
+ <version>${tika.version}</version>
</dependency>
<dependency>
- <groupId>net.sf.opencsv</groupId>
- <artifactId>opencsv</artifactId>
- <version>2.3</version>
+ <groupId>org.apache.tika</groupId>
+ <artifactId>tika-parser-html-module</artifactId>
+ <version>${tika.version}</version>
+ <scope>runtime</scope>
</dependency>
<dependency>
<groupId>org.apache.solr</groupId>
<artifactId>solr-core</artifactId>
- <version>8.11.3</version>
+ <version>${solr.version}</version>
<exclusions>
<exclusion>
<groupId>org.apache.hadoop</groupId>
@@ -138,20 +142,13 @@
<groupId>org.eclipse.jetty.http2</groupId>
<artifactId>*</artifactId>
</exclusion>
+ <exclusion>
+ <groupId>org.apache.logging.log4j</groupId>
+ <artifactId>*</artifactId>
+ </exclusion>
</exclusions>
</dependency>
- <dependency>
- <groupId>javax.mail</groupId>
- <artifactId>mail</artifactId>
- <version>1.4.7</version>
- </dependency>
- <dependency>
- <groupId>com.restfb</groupId>
- <artifactId>restfb</artifactId>
- <version>1.49.0</version>
- </dependency>
-
<dependency>
<groupId>net.billylieurance.azuresearch</groupId>
<artifactId>azure-bing-search-java</artifactId>
@@ -181,8 +178,8 @@
<dependency>
<groupId>org.docx4j</groupId>
- <artifactId>docx4j</artifactId>
- <version>6.1.2</version>
+ <artifactId>docx4j-core</artifactId>
+ <version>${docx4j.version}</version>
<exclusions>
<!-- Exclusion here as log4j version 2 bindings are used during
tests/runtime-->
<exclusion>
@@ -217,11 +214,7 @@
</exclusion>
</exclusions>
</dependency>
- <dependency>
- <groupId>org.deeplearning4j</groupId>
- <artifactId>deeplearning4j-ui</artifactId>
- <version>${dl4j.version}</version>
- </dependency>
+
<dependency>
<groupId>org.deeplearning4j</groupId>
<artifactId>deeplearning4j-nlp</artifactId>
@@ -252,10 +245,15 @@
<groupId>org.junit.jupiter</groupId>
<artifactId>junit-jupiter-params</artifactId>
</dependency>
+
+ <!-- Logging -->
<dependency>
- <groupId>org.apache.logging.log4j</groupId>
- <artifactId>log4j-api</artifactId>
- <scope>test</scope>
+ <groupId>org.slf4j</groupId>
+ <artifactId>slf4j-api</artifactId>
+ </dependency>
+ <dependency>
+ <groupId>org.slf4j</groupId>
+ <artifactId>log4j-over-slf4j</artifactId>
</dependency>
<dependency>
<groupId>org.apache.logging.log4j</groupId>
@@ -265,7 +263,7 @@
<dependency>
<groupId>org.apache.logging.log4j</groupId>
<artifactId>log4j-slf4j2-impl</artifactId>
- <scope>test</scope>
+ <scope>runtime</scope>
</dependency>
</dependencies>
@@ -444,7 +442,7 @@
<configuration>
<source>${maven.compiler.source}</source>
<target>${maven.compiler.target}</target>
- <compilerArgument>-Xlint</compilerArgument>
+ <compilerArgument>-Xlint:-options</compilerArgument>
</configuration>
</plugin>
diff --git
a/opennlp-similarity/src/main/java/opennlp/tools/apps/review_builder/FBOpenGraphSearchManager.java
b/opennlp-similarity/src/main/java/opennlp/tools/apps/review_builder/FBOpenGraphSearchManager.java
deleted file mode 100644
index f2a130a..0000000
---
a/opennlp-similarity/src/main/java/opennlp/tools/apps/review_builder/FBOpenGraphSearchManager.java
+++ /dev/null
@@ -1,148 +0,0 @@
-/*
- * Licensed to the Apache Software Foundation (ASF) under one or more
- * contributor license agreements. See the NOTICE file distributed with
- * this work for additional information regarding copyright ownership.
- * The ASF licenses this file to You under the Apache License, Version 2.0
- * (the "License"); you may not use this file except in compliance with
- * the License. You may obtain a copy of the License at
- *
- * http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-package opennlp.tools.apps.review_builder;
-
-import java.util.ArrayList;
-import java.util.List;
-
-import com.restfb.Connection;
-import com.restfb.DefaultFacebookClient;
-import com.restfb.FacebookClient;
-import com.restfb.Parameter;
-import com.restfb.exception.FacebookException;
-import com.restfb.types.Event;
-import com.restfb.types.Page;
-import org.apache.commons.lang.StringUtils;
-
-import opennlp.tools.jsmlearning.ProfileReaderWriter;
-import opennlp.tools.similarity.apps.utils.PageFetcher;
-
-public class FBOpenGraphSearchManager {
-
- public final List<String[]> profiles;
- protected FacebookClient mFBClient;
- protected final PageFetcher pageFetcher = new PageFetcher();
- protected static final int NUM_TRIES = 5;
- protected static final long WAIT_BTW_TRIES=1000; //milliseconds between
re-tries
-
- public FBOpenGraphSearchManager(){
- profiles =
ProfileReaderWriter.readProfiles("C:\\nc\\features\\analytics\\dealanalyzer\\sweetjack-localcoupon-may12012tooct302012.csv");
- }
-
- public void setFacebookClient(FacebookClient c){
- this.mFBClient=c;
- }
-
- public List<Event> getFBEventsByName(String event)
- {
- List<Event> events = new ArrayList<>();
-
- for(int i=0; i < NUM_TRIES; i++)
- {
- try
- {
- Connection<Event> publicSearch =
- mFBClient.fetchConnection("search", Event.class,
- Parameter.with("q", event),
Parameter.with("type", "event"),Parameter.with("limit", 100));
- System.out.println("Searching FB events for " + event);
- events= publicSearch.getData();
- break;
- }
- catch(FacebookException e)
- {
- System.out.println("FBError "+e);
- try
- {
- Thread.sleep(WAIT_BTW_TRIES);
- }
- catch (InterruptedException e1)
- {
- System.out.println("Error "+e1);
- }
- }
- }
- return events;
- }
-
- public Long getFBPageLikes(String merchant)
- {
- List<Page> groups = new ArrayList<>();
-
- for(int i=0; i < NUM_TRIES; i++)
- {
- try
- {
- Connection<Page> publicSearch =
- mFBClient.fetchConnection("search", Page.class,
- Parameter.with("q", merchant),
Parameter.with("type", "page"),Parameter.with("limit", 100));
- System.out.println("Searching FB Pages for " + merchant);
- groups= publicSearch.getData();
- break;
- }
- catch(FacebookException e)
- {
- System.out.println("FBError "+e);
- try
- {
- Thread.sleep(WAIT_BTW_TRIES);
- }
- catch (InterruptedException e1)
- {
- System.out.println("Error "+e1);
- }
- }
- }
-
- for (Page p: groups){
- if (p!=null && p.getLikes()!=null && p.getLikes()>0)
- return p.getLikes();
- }
-
- //stats fwb">235</span>
-
- for (Page p: groups){
- if (p.getId()==null)
- continue;
- String content =
pageFetcher.fetchOrigHTML("http://www.facebook.com/"+p.getId());
-
- String likes = StringUtils.substringBetween(content, "stats
fwb\">", "<" );
- if (likes==null)
- continue;
- int nLikes =0;
- try {
- nLikes = Integer.parseInt(likes);
- } catch (Exception e){
-
- }
- if (nLikes>0){
- return (long)nLikes;
- }
-
- }
- return null;
- }
-
- public static void main(String[] args){
- FBOpenGraphSearchManager man = new FBOpenGraphSearchManager ();
- man.setFacebookClient(new DefaultFacebookClient());
-
- long res = man.getFBPageLikes("chain saw");
- System.out.println(res);
-
- }
-}
diff --git
a/opennlp-similarity/src/main/java/opennlp/tools/apps/review_builder/WebPageReviewExtractor.java
b/opennlp-similarity/src/main/java/opennlp/tools/apps/review_builder/WebPageReviewExtractor.java
index 4448f58..14574f3 100644
---
a/opennlp-similarity/src/main/java/opennlp/tools/apps/review_builder/WebPageReviewExtractor.java
+++
b/opennlp-similarity/src/main/java/opennlp/tools/apps/review_builder/WebPageReviewExtractor.java
@@ -28,7 +28,6 @@ import opennlp.tools.similarity.apps.HitBase;
import opennlp.tools.similarity.apps.utils.StringDistanceMeasurer;
import opennlp.tools.similarity.apps.utils.Utils;
import opennlp.tools.textsimilarity.TextProcessor;
-import
opennlp.tools.textsimilarity.chunker2matcher.ParserChunker2MatcherProcessor;
import org.apache.commons.lang.StringUtils;
import org.slf4j.Logger;
@@ -392,7 +391,6 @@ public class WebPageReviewExtractor extends
WebPageExtractor {
public static void main(String[] args){
String resourceDir = "C:/stanford-corenlp/src/test/resources/";
- ParserChunker2MatcherProcessor proc =
ParserChunker2MatcherProcessor.getInstance(resourceDir);
//ProductFinderInAWebPage init = new
ProductFinderInAWebPage("C:/workspace/relevanceEngine/src/test/resources");
diff --git
a/opennlp-similarity/src/main/java/opennlp/tools/apps/utils/email/EmailSender.java
b/opennlp-similarity/src/main/java/opennlp/tools/apps/utils/email/EmailSender.java
index c5388fa..94ba811 100644
---
a/opennlp-similarity/src/main/java/opennlp/tools/apps/utils/email/EmailSender.java
+++
b/opennlp-similarity/src/main/java/opennlp/tools/apps/utils/email/EmailSender.java
@@ -17,19 +17,19 @@
package opennlp.tools.apps.utils.email;
-import javax.activation.DataHandler;
-import javax.activation.DataSource;
-import javax.activation.FileDataSource;
-import javax.mail.Authenticator;
-import javax.mail.BodyPart;
-import javax.mail.Message;
-import javax.mail.Multipart;
-import javax.mail.Session;
-import javax.mail.Transport;
-import javax.mail.internet.InternetAddress;
-import javax.mail.internet.MimeBodyPart;
-import javax.mail.internet.MimeMessage;
-import javax.mail.internet.MimeMultipart;
+import jakarta.activation.DataHandler;
+import jakarta.activation.DataSource;
+import jakarta.activation.FileDataSource;
+import jakarta.mail.Authenticator;
+import jakarta.mail.BodyPart;
+import jakarta.mail.Message;
+import jakarta.mail.Multipart;
+import jakarta.mail.Session;
+import jakarta.mail.Transport;
+import jakarta.mail.internet.InternetAddress;
+import jakarta.mail.internet.MimeBodyPart;
+import jakarta.mail.internet.MimeMessage;
+import jakarta.mail.internet.MimeMultipart;
import java.util.Properties;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
diff --git
a/opennlp-similarity/src/main/java/opennlp/tools/apps/utils/email/SMTPAuthenticator.java
b/opennlp-similarity/src/main/java/opennlp/tools/apps/utils/email/SMTPAuthenticator.java
index c48ab34..55f56dd 100644
---
a/opennlp-similarity/src/main/java/opennlp/tools/apps/utils/email/SMTPAuthenticator.java
+++
b/opennlp-similarity/src/main/java/opennlp/tools/apps/utils/email/SMTPAuthenticator.java
@@ -17,12 +17,12 @@
package opennlp.tools.apps.utils.email;
-import javax.mail.PasswordAuthentication;
+import jakarta.mail.PasswordAuthentication;
/**
* This contains the required information for the smtp authorization!
*/
-public class SMTPAuthenticator extends javax.mail.Authenticator {
+public class SMTPAuthenticator extends jakarta.mail.Authenticator {
private final String username;
private final String password;
diff --git
a/opennlp-similarity/src/main/java/opennlp/tools/doc_classifier/DocClassifier.java
b/opennlp-similarity/src/main/java/opennlp/tools/doc_classifier/DocClassifier.java
index 41bec16..784ebb2 100644
---
a/opennlp-similarity/src/main/java/opennlp/tools/doc_classifier/DocClassifier.java
+++
b/opennlp-similarity/src/main/java/opennlp/tools/doc_classifier/DocClassifier.java
@@ -24,10 +24,6 @@ import java.util.List;
import java.util.Map;
import java.util.Scanner;
-import opennlp.tools.similarity.apps.utils.CountItemsList;
-import opennlp.tools.similarity.apps.utils.ValueSortMap;
-import opennlp.tools.textsimilarity.TextProcessor;
-
import org.apache.commons.lang.StringUtils;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
@@ -45,6 +41,10 @@ import org.apache.lucene.store.FSDirectory;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
+import opennlp.tools.similarity.apps.utils.CountItemsList;
+import opennlp.tools.similarity.apps.utils.ValueSortMap;
+import opennlp.tools.textsimilarity.TextProcessor;
+
public class DocClassifier {
private static final Logger LOGGER =
LoggerFactory.getLogger(DocClassifier.class);
@@ -60,8 +60,8 @@ public class DocClassifier {
// http://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm
private static final int MAX_DOCS_TO_USE_FOR_CLASSIFY = 10, // 10
similar
- // docs for nearest neighbor settings
- MAX_CATEG_RESULTS = 2;
+ // docs for nearest neighbor settings
+ MAX_CATEG_RESULTS = 2;
private static final float BEST_TO_NEX_BEST_RATIO = 2.0f;
// to accumulate classif results
private final CountItemsList<String> localCats = new CountItemsList<>();
diff --git
a/opennlp-similarity/src/main/java/opennlp/tools/doc_classifier/DocClassifierTrainingSetMultilingualExtender.java
b/opennlp-similarity/src/main/java/opennlp/tools/doc_classifier/DocClassifierTrainingSetMultilingualExtender.java
index 29a5107..18d778c 100644
---
a/opennlp-similarity/src/main/java/opennlp/tools/doc_classifier/DocClassifierTrainingSetMultilingualExtender.java
+++
b/opennlp-similarity/src/main/java/opennlp/tools/doc_classifier/DocClassifierTrainingSetMultilingualExtender.java
@@ -27,11 +27,11 @@ import java.net.URL;
import java.nio.channels.Channels;
import java.nio.channels.ReadableByteChannel;
import java.nio.charset.StandardCharsets;
+import java.nio.file.Files;
import java.util.ArrayList;
import java.util.HashSet;
import java.util.List;
-import org.apache.commons.io.FileUtils;
import org.apache.commons.lang.StringUtils;
/*
@@ -86,7 +86,7 @@ public class DocClassifierTrainingSetMultilingualExtender {
List<String> filteredEntries = new ArrayList<>();
String content=null;
try {
- content = FileUtils.readFileToString(new
File(filename), StandardCharsets.UTF_8);
+ content = Files.readString(new File(filename).toPath(),
StandardCharsets.UTF_8);
} catch (IOException e) {
e.printStackTrace();
}
@@ -127,7 +127,7 @@ public class DocClassifierTrainingSetMultilingualExtender {
continue;
System.out.println("processing "+f.getName());
- content = FileUtils.readFileToString(f,
"utf-8");
+ content = Files.readString(f.toPath(),
StandardCharsets.UTF_8);
int langIndex =0;
for(String[] begEnd: MULTILINGUAL_TOKENS){
String urlDirty =
StringUtils.substringBetween(content, begEnd[0], begEnd[1]);
diff --git
a/opennlp-similarity/src/main/java/opennlp/tools/doc_classifier/DocClassifierTrainingSetVerifier.java
b/opennlp-similarity/src/main/java/opennlp/tools/doc_classifier/DocClassifierTrainingSetVerifier.java
index 95c2b27..d774c4d 100644
---
a/opennlp-similarity/src/main/java/opennlp/tools/doc_classifier/DocClassifierTrainingSetVerifier.java
+++
b/opennlp-similarity/src/main/java/opennlp/tools/doc_classifier/DocClassifierTrainingSetVerifier.java
@@ -18,12 +18,12 @@ package opennlp.tools.doc_classifier;
import java.io.File;
import java.io.IOException;
+import java.nio.file.Files;
import java.util.ArrayList;
import java.util.List;
import opennlp.tools.jsmlearning.ProfileReaderWriter;
-import org.apache.commons.io.FileUtils;
import org.apache.tika.Tika;
import org.apache.tika.exception.TikaException;
@@ -96,7 +96,7 @@ public class DocClassifierTrainingSetVerifier {
&& resultsClassif.get(0).equals(
ClassifierTrainingSetIndexer.getCategoryFromFilePath(f.getAbsolutePath()))){
String destFileName =
f.getAbsolutePath().replace(sourceDir, destinationDir);
- FileUtils.copyFile(f, new
File(destFileName));
+ Files.copy(f.toPath(), new
File(destFileName).toPath());
bRejected = false;
} else {
System.out.println("File "+
f.getAbsolutePath() + "\n classified as "+
diff --git
a/opennlp-similarity/src/main/java/opennlp/tools/enron_email_recognizer/EmailNormalizer.java
b/opennlp-similarity/src/main/java/opennlp/tools/enron_email_recognizer/EmailNormalizer.java
index 6e1ebe9..3fde124 100644
---
a/opennlp-similarity/src/main/java/opennlp/tools/enron_email_recognizer/EmailNormalizer.java
+++
b/opennlp-similarity/src/main/java/opennlp/tools/enron_email_recognizer/EmailNormalizer.java
@@ -20,10 +20,9 @@ package opennlp.tools.enron_email_recognizer;
import java.io.File;
import java.io.IOException;
import java.nio.charset.StandardCharsets;
+import java.nio.file.Files;
import java.util.ArrayList;
-import org.apache.commons.io.FileUtils;
-
public class EmailNormalizer {
protected final ArrayList<File> queue = new ArrayList<>();
@@ -67,7 +66,7 @@ public class EmailNormalizer {
public void normalizeAndWriteIntoANewFile(File f){
String content = "";
try {
- content = FileUtils.readFileToString(f,
StandardCharsets.UTF_8);
+ content = Files.readString(f.toPath(),
StandardCharsets.UTF_8);
} catch (IOException e) {
e.printStackTrace();
}
@@ -95,10 +94,10 @@ public class EmailNormalizer {
String directoryNew = f.getAbsolutePath().replace(origFolder,
newFolder);
try {
String fullFileNameNew = directoryNew +"txt";
- FileUtils.writeStringToFile(new File(fullFileNameNew),
buf.toString(), StandardCharsets.UTF_8);
- } catch (IOException e) {
- e.printStackTrace();
- }
+ Files.writeString(new File(fullFileNameNew).toPath(),
buf.toString(), StandardCharsets.UTF_8);
+ } catch (IOException e) {
+ e.printStackTrace();
+ }
}
public void normalizeDirectory(File f){
diff --git
a/opennlp-similarity/src/main/java/opennlp/tools/enron_email_recognizer/EmailTrainingSetFormer.java
b/opennlp-similarity/src/main/java/opennlp/tools/enron_email_recognizer/EmailTrainingSetFormer.java
index 1a8ce6d..2551052 100644
---
a/opennlp-similarity/src/main/java/opennlp/tools/enron_email_recognizer/EmailTrainingSetFormer.java
+++
b/opennlp-similarity/src/main/java/opennlp/tools/enron_email_recognizer/EmailTrainingSetFormer.java
@@ -20,10 +20,9 @@ package opennlp.tools.enron_email_recognizer;
import java.io.File;
import java.io.IOException;
import java.nio.charset.StandardCharsets;
+import java.nio.file.Files;
import java.util.List;
-import org.apache.commons.io.FileUtils;
-
public class EmailTrainingSetFormer {
static final String DATA_DIR = "/Users/bgalitsky/Downloads/";
static final String FILE_LIST_FILE = "cats4_11-17.txt";
@@ -32,14 +31,14 @@ public class EmailTrainingSetFormer {
//enron_with_categories/5/70665.cats:4,10,1
public static void createPosTrainingSet(){
try {
- List<String> lines = FileUtils.readLines(new
File(DATA_DIR + FILE_LIST_FILE), StandardCharsets.UTF_8);
+ List<String> lines = Files.readAllLines(new
File(DATA_DIR + FILE_LIST_FILE).toPath(), StandardCharsets.UTF_8);
for(String l: lines){
int endOfFname = l.indexOf('.'), startOfFname =
l.lastIndexOf('/');
String filenameOld = DATA_DIR + l.substring(0,
endOfFname)+".txt";
String content = normalize(new
File(filenameOld));
String filenameNew = DESTINATION_DIR +
l.substring(startOfFname+1, endOfFname)+".txt";
//FileUtils.copyFile(new File(filenameOld), new
File(filenameNew));
- FileUtils.writeStringToFile(new
File(filenameNew), content, StandardCharsets.UTF_8);
+ Files.writeString(new
File(filenameNew).toPath(), content, StandardCharsets.UTF_8);
}
} catch (Exception e) {
e.printStackTrace();
@@ -52,7 +51,7 @@ public class EmailTrainingSetFormer {
public static String normalize(File f){
String content="";
try {
- content = FileUtils.readFileToString(f,
StandardCharsets.UTF_8);
+ content = Files.readString(f.toPath(),
StandardCharsets.UTF_8);
} catch (IOException e) {
e.printStackTrace();
}
diff --git a/opennlp-similarity/src/main/java/opennlp/tools/nl2code/NL2Obj.java
b/opennlp-similarity/src/main/java/opennlp/tools/nl2code/NL2Obj.java
index e4beac6..3d8929f 100644
--- a/opennlp-similarity/src/main/java/opennlp/tools/nl2code/NL2Obj.java
+++ b/opennlp-similarity/src/main/java/opennlp/tools/nl2code/NL2Obj.java
@@ -30,18 +30,15 @@ public class NL2Obj {
ObjectControlOp prevOp;
public NL2Obj(String path) {
+ this();
+ }
+
+ public NL2Obj() {
prevOp = new ObjectControlOp();
prevOp.setOperatorIf("");
prevOp.setOperatorFor("");
- parser = ParserChunker2MatcherProcessor.getInstance(path);
+ parser = ParserChunker2MatcherProcessor.getInstance();
}
-
- public NL2Obj() {
- prevOp = new ObjectControlOp();
- prevOp.setOperatorIf("");
- prevOp.setOperatorFor("");
- parser = ParserChunker2MatcherProcessor.getInstance();
- }
static final String[] EPISTEMIC_STATES_LIST = new String[] {
"select", "verify", "find", "start", "stop", "go", "check"
diff --git
a/opennlp-similarity/src/main/java/opennlp/tools/parse_thicket/apps/MostFrequentWordsFromPageGetter.java
b/opennlp-similarity/src/main/java/opennlp/tools/parse_thicket/apps/MostFrequentWordsFromPageGetter.java
index 0e937f5..54c0b8b 100644
---
a/opennlp-similarity/src/main/java/opennlp/tools/parse_thicket/apps/MostFrequentWordsFromPageGetter.java
+++
b/opennlp-similarity/src/main/java/opennlp/tools/parse_thicket/apps/MostFrequentWordsFromPageGetter.java
@@ -31,23 +31,22 @@ public class MostFrequentWordsFromPageGetter {
public List<String> getMostFrequentWordsInText(String input) {
int maxRes = 4;
Scanner in = new Scanner(input);
- in.useDelimiter("\\s+");
- Map<String, Integer> words = new HashMap<>();
-
- while (in.hasNext()) {
- String word = in.next();
- if (!StringUtils.isAlpha(word) || word.length()<4 )
- continue;
-
- if (!words.containsKey(word)) {
- words.put(word, 1);
- } else {
- words.put(word, words.get(word) + 1);
- }
+ in.useDelimiter("\\s+");
+ Map<String, Integer> words = new HashMap<>();
+
+ while (in.hasNext()) {
+ String word = in.next();
+ if (!StringUtils.isAlpha(word) || word.length()<4 )
+ continue;
+
+ if (!words.containsKey(word)) {
+ words.put(word, 1);
+ } else {
+ words.put(word, words.get(word) + 1);
}
-
- words = ValueSortMap.sortMapByValue(words, false);
- List<String> results = new ArrayList<>(words.keySet());
+ }
+ words = ValueSortMap.sortMapByValue(words, false);
+ List<String> results = new ArrayList<>(words.keySet());
if (results.size() > maxRes )
results = results.subList(0, maxRes); // get maxRes
elements
diff --git
a/opennlp-similarity/src/main/java/opennlp/tools/similarity/apps/ContentGeneratorRunner.java
b/opennlp-similarity/src/main/java/opennlp/tools/similarity/apps/ContentGeneratorRunner.java
index b6bc2b1..0bf2e59 100644
---
a/opennlp-similarity/src/main/java/opennlp/tools/similarity/apps/ContentGeneratorRunner.java
+++
b/opennlp-similarity/src/main/java/opennlp/tools/similarity/apps/ContentGeneratorRunner.java
@@ -18,26 +18,13 @@ package opennlp.tools.similarity.apps;
import java.util.List;
-import javax.mail.internet.AddressException;
-import javax.mail.internet.InternetAddress;
-
-import
opennlp.tools.textsimilarity.chunker2matcher.ParserChunker2MatcherProcessor;
+import jakarta.mail.internet.AddressException;
+import jakarta.mail.internet.InternetAddress;
public class ContentGeneratorRunner {
+
public static void main(String[] args) {
- ParserChunker2MatcherProcessor sm = null;
-
- try {
- String resourceDir = args[2];
- if (resourceDir!=null)
- sm =
ParserChunker2MatcherProcessor.getInstance(resourceDir);
- else
- sm =
ParserChunker2MatcherProcessor.getInstance();
-
- } catch (Exception e) {
- e.printStackTrace();
- }
-
+
String bingKey = args[7];
if (bingKey == null){
bingKey =
"e8ADxIjn9YyHx36EihdjH/tMqJJItUrrbPTUpKahiU0=";
diff --git
a/opennlp-similarity/src/main/java/opennlp/tools/similarity/apps/solr/CommentsRel.java
b/opennlp-similarity/src/main/java/opennlp/tools/similarity/apps/solr/CommentsRel.java
index e80e94e..85c4714 100644
---
a/opennlp-similarity/src/main/java/opennlp/tools/similarity/apps/solr/CommentsRel.java
+++
b/opennlp-similarity/src/main/java/opennlp/tools/similarity/apps/solr/CommentsRel.java
@@ -23,7 +23,7 @@ import java.io.File;
import java.io.IOException;
import java.math.BigInteger;
-import javax.xml.bind.JAXBException;
+import jakarta.xml.bind.JAXBException;
import org.docx4j.XmlUtils;
import org.docx4j.jaxb.Context;
diff --git
a/opennlp-similarity/src/main/java/opennlp/tools/similarity/apps/solr/ContentGeneratorRequestHandler.java
b/opennlp-similarity/src/main/java/opennlp/tools/similarity/apps/solr/ContentGeneratorRequestHandler.java
index a40c0bb..5403ab5 100644
---
a/opennlp-similarity/src/main/java/opennlp/tools/similarity/apps/solr/ContentGeneratorRequestHandler.java
+++
b/opennlp-similarity/src/main/java/opennlp/tools/similarity/apps/solr/ContentGeneratorRequestHandler.java
@@ -16,31 +16,29 @@
*/
package opennlp.tools.similarity.apps.solr;
-import java.io.BufferedReader;
import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
-import java.io.InputStream;
-import java.io.InputStreamReader;
+import java.lang.invoke.MethodHandles;
import java.util.List;
-import java.util.logging.Logger;
-import javax.mail.internet.AddressException;
-import javax.mail.internet.InternetAddress;
+import jakarta.mail.internet.AddressException;
+import jakarta.mail.internet.InternetAddress;
import org.apache.solr.common.util.NamedList;
import org.apache.solr.handler.component.SearchHandler;
import org.apache.solr.request.SolrQueryRequest;
import org.apache.solr.response.SolrQueryResponse;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
import opennlp.tools.similarity.apps.HitBase;
import opennlp.tools.similarity.apps.RelatedSentenceFinder;
import opennlp.tools.similarity.apps.RelatedSentenceFinderML;
-import
opennlp.tools.textsimilarity.chunker2matcher.ParserChunker2MatcherProcessor;
public class ContentGeneratorRequestHandler extends SearchHandler {
- private static final Logger LOG =
-
Logger.getLogger("com.become.search.requestHandlers.SearchResultsReRankerRequestHandler");
+ private static final Logger LOG =
LoggerFactory.getLogger(MethodHandles.lookup().lookupClass());
+
private final WordDocBuilderEndNotes docBuilder = new
WordDocBuilderEndNotes ();
public void handleRequestBody(SolrQueryRequest req, SolrQueryResponse
rsp){
@@ -97,44 +95,13 @@ public class ContentGeneratorRequestHandler extends
SearchHandler {
}
- static class StreamLogger extends Thread{
-
- private final InputStream mInputStream;
-
- public StreamLogger(InputStream is) {
- this.mInputStream = is;
- }
-
- public void run() {
- try {
- InputStreamReader isr = new
InputStreamReader(mInputStream);
- BufferedReader br = new BufferedReader(isr);
- String line;
- while ((line = br.readLine()) != null) {
- System.out.println(line);
- }
- } catch (IOException ioe) {
- ioe.printStackTrace();
- }
- }
- }
-
public String cgRunner(String[] args) {
- int count=0;
+
+ int count=0;
for(String a: args){
System.out.print(count+">>" + a + " | ");
count++;
}
- try {
- String resourceDir = args[2];
- ParserChunker2MatcherProcessor sm = null;
- if (resourceDir!=null)
- sm =
ParserChunker2MatcherProcessor.getInstance(resourceDir);
- else
- sm =
ParserChunker2MatcherProcessor.getInstance();
- } catch (Exception e) {
- e.printStackTrace();
- }
String bingKey = args[7];
if (bingKey == null){
diff --git
a/opennlp-similarity/src/main/java/opennlp/tools/similarity/apps/solr/SearchResultsReRankerRequestHandler.java
b/opennlp-similarity/src/main/java/opennlp/tools/similarity/apps/solr/SearchResultsReRankerRequestHandler.java
index 3e77f43..c7345fc 100644
---
a/opennlp-similarity/src/main/java/opennlp/tools/similarity/apps/solr/SearchResultsReRankerRequestHandler.java
+++
b/opennlp-similarity/src/main/java/opennlp/tools/similarity/apps/solr/SearchResultsReRankerRequestHandler.java
@@ -16,11 +16,11 @@
*/
package opennlp.tools.similarity.apps.solr;
+import java.lang.invoke.MethodHandles;
import java.util.ArrayList;
import java.util.Comparator;
import java.util.Iterator;
import java.util.List;
-import java.util.logging.Logger;
import opennlp.tools.similarity.apps.HitBase;
import opennlp.tools.textsimilarity.ParseTreeChunk;
@@ -34,16 +34,16 @@ import org.apache.solr.common.util.NamedList;
import org.apache.solr.handler.component.SearchHandler;
import org.apache.solr.request.SolrQueryRequest;
import org.apache.solr.response.SolrQueryResponse;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
public class SearchResultsReRankerRequestHandler extends SearchHandler {
- private static final Logger LOG =
-
Logger.getLogger("com.become.search.requestHandlers.SearchResultsReRankerRequestHandler");
+
+ private static final Logger LOG =
LoggerFactory.getLogger(MethodHandles.lookup().lookupClass());
+
private final static int MAX_SEARCH_RESULTS = 100;
private final ParseTreeChunkListScorer parseTreeChunkListScorer = new
ParseTreeChunkListScorer();
private ParserChunker2MatcherProcessor sm = null;
- private static final String RESOURCE_DIR =
"/home/solr/solr-4.4.0/example/src/test/resources";
- //"C:/workspace/TestSolr/src/test/resources";
- //"/data1/solr/example/src/test/resources";
public void handleRequestBody(SolrQueryRequest req, SolrQueryResponse
rsp){
// get query string
@@ -66,10 +66,6 @@ public class SearchResultsReRankerRequestHandler extends
SearchHandler {
List<HitBase> searchResults = new ArrayList<>();
-
-
-
-
for (int i = 0; i< MAX_SEARCH_RESULTS; i++){
String title = req.getParams().get("t"+i);
String descr = req.getParams().get("d"+i);
@@ -106,7 +102,6 @@ public class SearchResultsReRankerRequestHandler extends
SearchHandler {
}
}
-
List<HitBase> reRankedResults;
query = query.replace('+', ' ');
if (tooFewKeywords(query)|| orQuery(query)){
@@ -165,12 +160,11 @@ public class SearchResultsReRankerRequestHandler extends
SearchHandler {
return false;
}
- private List<HitBase> calculateMatchScoreResortHits(List<HitBase> hits,
- String searchQuery) {
+ private List<HitBase> calculateMatchScoreResortHits(List<HitBase> hits,
String searchQuery) {
try {
- sm =
ParserChunker2MatcherProcessor.getInstance(RESOURCE_DIR);
- } catch (Exception e){
- LOG.severe(e.getMessage());
+ sm = ParserChunker2MatcherProcessor.getInstance();
+ } catch (RuntimeException e){
+ LOG.error(e.getMessage(), e);
}
List<HitBase> newHitList = new ArrayList<>();
diff --git
a/opennlp-similarity/src/main/java/opennlp/tools/similarity/apps/solr/WordDocBuilderEndNotes.java
b/opennlp-similarity/src/main/java/opennlp/tools/similarity/apps/solr/WordDocBuilderEndNotes.java
index afe37fc..dcda0ce 100644
---
a/opennlp-similarity/src/main/java/opennlp/tools/similarity/apps/solr/WordDocBuilderEndNotes.java
+++
b/opennlp-similarity/src/main/java/opennlp/tools/similarity/apps/solr/WordDocBuilderEndNotes.java
@@ -16,15 +16,11 @@
*/
package opennlp.tools.similarity.apps.solr;
-
import java.io.File;
import java.math.BigInteger;
import java.util.ArrayList;
import java.util.List;
-import javax.xml.bind.JAXBException;
-
-import org.apache.commons.lang.StringUtils;
import org.docx4j.XmlUtils;
import org.docx4j.jaxb.Context;
import org.docx4j.openpackaging.exceptions.InvalidFormatException;
@@ -69,7 +65,7 @@ public class WordDocBuilderEndNotes extends
WordDocBuilderSingleImageSearchCall{
String processedParaTitle =
processParagraphTitle(para.getTitle());
if (processedParaTitle!=null &&
-
!processedParaTitle.endsWith("..") ||
StringUtils.isAlphanumeric(processedParaTitle)){
+
!processedParaTitle.endsWith("..") ||
processedParaTitle.chars().allMatch(this::isAlphanumeric)){
wordMLPackage.getMainDocumentPart().addStyledParagraphOfText("Subtitle",processedParaTitle);
}
String paraText =
processParagraphText(para.getFragments().toString());
@@ -85,7 +81,7 @@ public class WordDocBuilderEndNotes extends
WordDocBuilderSingleImageSearchCall{
"<w:rStyle
w:val=\"EndnoteReference\"/></w:rPr><w:endnoteRef/></w:r><w:r><w:t
xml:space=\"preserve\"> "+ url + "</w:t></w:r></w:p>";
try {
endnote.getEGBlockLevelElts().add( XmlUtils.unmarshalString(endnoteBody));
- } catch (JAXBException e) {
+ } catch (Exception e) {
e.printStackTrace();
}
@@ -95,7 +91,7 @@ public class WordDocBuilderEndNotes extends
WordDocBuilderSingleImageSearchCall{
try {
wordMLPackage.getMainDocumentPart().addParagraph(docBody);
- } catch (JAXBException e) {
+ } catch (Exception e) {
e.printStackTrace();
}
@@ -172,20 +168,25 @@ public class WordDocBuilderEndNotes extends
WordDocBuilderSingleImageSearchCall{
return bestPart;
}
+ private boolean isAlphanumeric(final int codePoint) {
+ return (codePoint >= 65 && codePoint <= 90) ||
+ (codePoint >= 97 && codePoint
<= 122) ||
+ (codePoint >= 48 && codePoint
<= 57);
+ }
- public static void main(String[] args){
- WordDocBuilderEndNotes b = new WordDocBuilderEndNotes();
- List<HitBase> content = new ArrayList<>();
- for(int i = 0; i<10; i++){
- HitBase h = new HitBase();
- h.setTitle("albert einstein "+i);
- List<Fragment> frs = new ArrayList<>();
- frs.add(new Fragment(" content "+i, 0));
- h.setFragments(frs);
- h.setUrl("http://www."+i+".com");
- content.add(h);
- }
-
- b.buildWordDoc(content, "albert einstein");
- }
+ public static void main(String[] args){
+ WordDocBuilderEndNotes b = new WordDocBuilderEndNotes();
+ List<HitBase> content = new ArrayList<>();
+ for(int i = 0; i<10; i++){
+ HitBase h = new HitBase();
+ h.setTitle("albert einstein "+i);
+ List<Fragment> frs = new ArrayList<>();
+ frs.add(new Fragment(" content "+i, 0));
+ h.setFragments(frs);
+ h.setUrl("http://www."+i+".com");
+ content.add(h);
+ }
+
+ b.buildWordDoc(content, "albert einstein");
+ }
}
diff --git
a/opennlp-similarity/src/main/java/opennlp/tools/textsimilarity/chunker2matcher/ParserChunker2MatcherProcessor.java
b/opennlp-similarity/src/main/java/opennlp/tools/textsimilarity/chunker2matcher/ParserChunker2MatcherProcessor.java
index 22dc78b..97eda63 100644
---
a/opennlp-similarity/src/main/java/opennlp/tools/textsimilarity/chunker2matcher/ParserChunker2MatcherProcessor.java
+++
b/opennlp-similarity/src/main/java/opennlp/tools/textsimilarity/chunker2matcher/ParserChunker2MatcherProcessor.java
@@ -18,11 +18,7 @@
package opennlp.tools.textsimilarity.chunker2matcher;
-import java.io.BufferedInputStream;
-import java.io.File;
-import java.io.FileInputStream;
import java.io.IOException;
-import java.io.InputStream;
import java.lang.invoke.MethodHandles;
import java.util.ArrayList;
import java.util.HashMap;
@@ -39,18 +35,19 @@ import opennlp.tools.parser.ParserFactory;
import opennlp.tools.parser.ParserModel;
import opennlp.tools.postag.POSModel;
import opennlp.tools.postag.POSTagger;
-import opennlp.tools.postag.POSTaggerME;
+import opennlp.tools.postag.ThreadSafePOSTaggerME;
import opennlp.tools.sentdetect.SentenceDetector;
-import opennlp.tools.sentdetect.SentenceDetectorME;
import opennlp.tools.sentdetect.SentenceModel;
+import opennlp.tools.sentdetect.ThreadSafeSentenceDetectorME;
import opennlp.tools.textsimilarity.LemmaPair;
import opennlp.tools.textsimilarity.ParseTreeChunk;
import opennlp.tools.textsimilarity.ParseTreeMatcherDeterministic;
import opennlp.tools.textsimilarity.SentencePairMatchResult;
import opennlp.tools.textsimilarity.TextProcessor;
+import opennlp.tools.tokenize.ThreadSafeTokenizerME;
import opennlp.tools.tokenize.Tokenizer;
-import opennlp.tools.tokenize.TokenizerME;
import opennlp.tools.tokenize.TokenizerModel;
+import opennlp.tools.util.DownloadUtil;
import opennlp.tools.util.Span;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
@@ -60,11 +57,6 @@ public class ParserChunker2MatcherProcessor {
private static final Logger LOG =
LoggerFactory.getLogger(MethodHandles.lookup().lookupClass());
static final int MIN_SENTENCE_LENGTH = 10;
- private static final String MODEL_DIR_KEY = "nlp.models.dir";
- // TODO config
- // this is where resources should live
- private static String MODEL_DIR=null;
- private static final String MODEL_DIR_REL = "src/test/resources/models";
protected static ParserChunker2MatcherProcessor instance;
private SentenceDetector sentenceDetector;
@@ -75,30 +67,6 @@ public class ParserChunker2MatcherProcessor {
private static final int NUMBER_OF_SECTIONS_IN_SENTENCE_CHUNKS = 5;
private Map<String, String[][]> sentence_parseObject;
- public SentenceDetector getSentenceDetector() {
- return sentenceDetector;
- }
-
- public void setSentenceDetector(SentenceDetector sentenceDetector) {
- this.sentenceDetector = sentenceDetector;
- }
-
- public Tokenizer getTokenizer() {
- return tokenizer;
- }
-
- public void setTokenizer(Tokenizer tokenizer) {
- this.tokenizer = tokenizer;
- }
-
- public ChunkerME getChunker() {
- return chunker;
- }
-
- public void setChunker(ChunkerME chunker) {
- this.chunker = chunker;
- }
-
@SuppressWarnings("unchecked")
protected ParserChunker2MatcherProcessor() {
try {
@@ -108,29 +76,65 @@ public class ParserChunker2MatcherProcessor {
LOG.warn("parsing cache file does not exist (but should be created)");
sentence_parseObject = new HashMap<>();
}
- if (sentence_parseObject == null)
- sentence_parseObject = new HashMap<>();
try {
- if (MODEL_DIR==null || MODEL_DIR.equals("/models")) {
- String absPath = new File(".").getAbsolutePath();
- absPath = absPath.substring(0, absPath.length()-1);
- MODEL_DIR = absPath + MODEL_DIR_REL;
- }
- //get full path from constructor
-
initializeSentenceDetector();
initializeTokenizer();
initializePosTagger();
initializeParser();
initializeChunker();
- } catch (Exception e) { // a typical error when 'model' is not installed
- LOG.warn("The model can't be read and we rely on cache");
- LOG.warn("Please put OpenNLP model files in 'src/test/resources' (folder
'model')");
+ } catch (IOException e) {
+ LOG.warn("A model can't be loaded: {}", e.getMessage());
}
}
- // closing the processor, clearing loaded ling models and serializing
parsing cache
+ protected void initializeSentenceDetector() throws IOException {
+ SentenceModel model = DownloadUtil.downloadModel(
+ "en", DownloadUtil.ModelType.SENTENCE_DETECTOR,
SentenceModel.class);
+ sentenceDetector = new ThreadSafeSentenceDetectorME(model);
+ }
+
+ protected void initializeTokenizer() throws IOException {
+ TokenizerModel model = DownloadUtil.downloadModel(
+ "en", DownloadUtil.ModelType.TOKENIZER, TokenizerModel.class);
+ tokenizer = new ThreadSafeTokenizerME(model);
+ }
+
+ protected void initializePosTagger() throws IOException {
+ POSModel model = DownloadUtil.downloadModel(
+ "en", DownloadUtil.ModelType.POS, POSModel.class);
+ posTagger = new ThreadSafePOSTaggerME(model);
+ }
+
+ protected void initializeParser() throws IOException {
+ ParserModel model = DownloadUtil.downloadModel(
+ "en", DownloadUtil.ModelType.PARSER, ParserModel.class);
+ parser = ParserFactory.create(model);
+ }
+
+ private void initializeChunker() throws IOException {
+ ChunkerModel model = DownloadUtil.downloadModel(
+ "en", DownloadUtil.ModelType.CHUNKER, ChunkerModel.class);
+ chunker = new ChunkerME(model);
+ }
+
+ public SentenceDetector getSentenceDetector() {
+ return sentenceDetector;
+ }
+
+ public Tokenizer getTokenizer() {
+ return tokenizer;
+ }
+
+ public POSTagger getPOSTagger() {
+ return posTagger;
+ }
+
+ public ChunkerME getChunker() {
+ return chunker;
+ }
+
+ // closing the processor and serializing parsing cache
public void close() {
instance = null;
ParserCacheSerializer.writeObject(sentence_parseObject);
@@ -147,14 +151,6 @@ public class ParserChunker2MatcherProcessor {
return instance;
}
-
- public synchronized static ParserChunker2MatcherProcessor getInstance(String
fullPathToResources) {
- MODEL_DIR = fullPathToResources+"/models";
- if (instance == null)
- instance = new ParserChunker2MatcherProcessor();
-
- return instance;
- }
/**
* General parsing function, which returns lists of parses for a portion of
@@ -165,7 +161,7 @@ public class ParserChunker2MatcherProcessor {
* @return lists of parses
*/
public List<List<Parse>> parseTextNlp(String text) {
- if (text == null || text.trim().length() == 0)
+ if (text == null || text.trim().isEmpty())
return null;
List<List<Parse>> textParses = new ArrayList<>(1);
@@ -173,7 +169,7 @@ public class ParserChunker2MatcherProcessor {
// parse paragraph by paragraph
String[] paragraphList = splitParagraph(text);
for (String paragraph : paragraphList) {
- if (paragraph.length() == 0)
+ if (paragraph.isEmpty())
continue;
List<Parse> paragraphParses = parseParagraphNlp(paragraph);
@@ -185,7 +181,7 @@ public class ParserChunker2MatcherProcessor {
}
public List<Parse> parseParagraphNlp(String paragraph) {
- if (paragraph == null || paragraph.trim().length() == 0)
+ if (paragraph == null || paragraph.trim().isEmpty())
return null;
// normalize the text before parsing, otherwise, the sentences may not
@@ -197,7 +193,7 @@ public class ParserChunker2MatcherProcessor {
List<Parse> parseList = new ArrayList<>(sentences.length);
for (String sentence : sentences) {
sentence = sentence.trim();
- if (sentence.length() == 0)
+ if (sentence.isEmpty())
continue;
Parse sentenceParse = parseSentenceNlp(sentence, false);
@@ -250,9 +246,8 @@ public class ParserChunker2MatcherProcessor {
List<List<ParseTreeChunk>> singleSentChunks =
formGroupedPhrasesFromChunksForSentence(sent);
if (singleSentChunks == null)
continue;
- if (listOfChunksAccum.size() < 1) {
- listOfChunksAccum = new ArrayList<>(
- singleSentChunks);
+ if (listOfChunksAccum.isEmpty()) {
+ listOfChunksAccum = new ArrayList<>(singleSentChunks);
} else
for (int i = 0; i < NUMBER_OF_SECTIONS_IN_SENTENCE_CHUNKS; i++) {
// make sure not null
@@ -468,7 +463,7 @@ public class ParserChunker2MatcherProcessor {
public static List<List<SentenceNode>> textToSentenceNodes(
List<List<Parse>> textParses) {
- if (textParses == null || textParses.size() == 0)
+ if (textParses == null || textParses.isEmpty())
return null;
List<List<SentenceNode>> textNodes = new ArrayList<>(
@@ -477,18 +472,18 @@ public class ParserChunker2MatcherProcessor {
List<SentenceNode> paragraphNodes =
paragraphToSentenceNodes(paragraphParses);
// append paragraph node if any
- if (paragraphNodes != null && paragraphNodes.size() > 0)
+ if (paragraphNodes != null && !paragraphNodes.isEmpty())
textNodes.add(paragraphNodes);
}
- if (textNodes.size() > 0)
+ if (!textNodes.isEmpty())
return textNodes;
else
return null;
}
public static List<SentenceNode> paragraphToSentenceNodes(List<Parse>
paragraphParses) {
- if (paragraphParses == null || paragraphParses.size() == 0)
+ if (paragraphParses == null || paragraphParses.isEmpty())
return null;
List<SentenceNode> paragraphNodes = new
ArrayList<>(paragraphParses.size());
@@ -506,7 +501,7 @@ public class ParserChunker2MatcherProcessor {
paragraphNodes.add(sentenceNode);
}
- if (paragraphNodes.size() > 0)
+ if (!paragraphNodes.isEmpty())
return paragraphNodes;
else
return null;
@@ -518,10 +513,10 @@ public class ParserChunker2MatcherProcessor {
// convert the OpenNLP Parse to our own tree nodes
SyntacticTreeNode node = toSyntacticTreeNode(sentenceParse);
- if ((node == null))
+ if (node == null)
return null;
- if (node instanceof SentenceNode)
- return (SentenceNode) node;
+ if (node instanceof SentenceNode sn)
+ return sn;
else if (node instanceof PhraseNode) {
return new SentenceNode("sentence", node.getChildren());
} else
@@ -575,56 +570,6 @@ public class ParserChunker2MatcherProcessor {
return tokenizer.tokenize(sentence);
}
- protected void initializeSentenceDetector() {
- try (InputStream is = new BufferedInputStream(new
FileInputStream(MODEL_DIR + "/en-sent.bin"))) {
- SentenceModel model = new SentenceModel(is);
- sentenceDetector = new SentenceDetectorME(model);
- } catch (IOException e) {
- // we swallow exception to support the cached run
- LOG.debug(e.getLocalizedMessage(), e);
- }
- }
-
- protected void initializeTokenizer() {
- try (InputStream is = new BufferedInputStream(new
FileInputStream(MODEL_DIR + "/en-token.bin"))) {
- TokenizerModel model = new TokenizerModel(is);
- tokenizer = new TokenizerME(model);
- } catch (IOException e) {
- // we swallow exception to support the cached run
- LOG.debug(e.getLocalizedMessage(), e);
- }
- }
-
- protected void initializePosTagger() {
- try (InputStream is = new BufferedInputStream(new
FileInputStream(MODEL_DIR + "/en-pos-maxent.bin"))) {
- POSModel model = new POSModel(is);
- posTagger = new POSTaggerME(model);
- } catch (IOException e) {
- // we swallow exception to support the cached run
- LOG.debug(e.getLocalizedMessage(), e);
- }
- }
-
- protected void initializeParser() {
- try (InputStream is = new BufferedInputStream(new
FileInputStream(MODEL_DIR + "/en-parser-chunking.bin"))) {
- ParserModel model = new ParserModel(is);
- parser = ParserFactory.create(model);
- } catch (IOException e) {
- // we swallow exception to support the cached run
- LOG.debug(e.getLocalizedMessage(), e);
- }
- }
-
- private void initializeChunker() {
- try (InputStream is = new BufferedInputStream(new
FileInputStream(MODEL_DIR + "/en-chunker.bin"))) {
- ChunkerModel model = new ChunkerModel(is);
- chunker = new ChunkerME(model);
- } catch (IOException e) {
- // we swallow exception to support the cached run
- LOG.debug(e.getLocalizedMessage(), e);
- }
- }
-
/**
* convert an instance of Parse to SyntacticTreeNode, by filtering out the
* unnecessary data and assigning the word for each node
@@ -641,11 +586,11 @@ public class ParserChunker2MatcherProcessor {
return null;
String text = parse.getText();
- ArrayList<SyntacticTreeNode> childrenNodeList =
convertChildrenNodes(parse);
+ List<SyntacticTreeNode> childrenNodeList = convertChildrenNodes(parse);
// check sentence node, the node contained in the top node
if (type.equals(AbstractBottomUpParser.TOP_NODE)
- && childrenNodeList != null && childrenNodeList.size() > 0) {
+ && childrenNodeList != null && !childrenNodeList.isEmpty()) {
PhraseNode rootNode;
try {
rootNode = (PhraseNode) childrenNodeList.get(0);
@@ -656,7 +601,7 @@ public class ParserChunker2MatcherProcessor {
}
// if this node contains children nodes, then it is a phrase node
- if (childrenNodeList != null && childrenNodeList.size() > 0) {
+ if (childrenNodeList != null && !childrenNodeList.isEmpty()) {
// System.out.println("Found "+ type + " phrase = "+ childrenNodeList);
return new PhraseNode(type, childrenNodeList);
@@ -669,7 +614,7 @@ public class ParserChunker2MatcherProcessor {
return new WordNode(type, word);
}
- private static ArrayList<SyntacticTreeNode> convertChildrenNodes(Parse
parse) {
+ private static List<SyntacticTreeNode> convertChildrenNodes(Parse parse) {
if (parse == null)
return null;
@@ -677,7 +622,7 @@ public class ParserChunker2MatcherProcessor {
if (children == null || children.length == 0)
return null;
- ArrayList<SyntacticTreeNode> childrenNodeList = new ArrayList<>();
+ List<SyntacticTreeNode> childrenNodeList = new ArrayList<>();
for (Parse child : children) {
SyntacticTreeNode childNode = toSyntacticTreeNode(child);
if (childNode != null)
@@ -711,7 +656,7 @@ public class ParserChunker2MatcherProcessor {
protected List<LemmaPair> listListParseTreeChunk2ListLemmaPairs(
List<List<ParseTreeChunk>> sent1GrpLst) {
List<LemmaPair> results = new ArrayList<>();
- if (sent1GrpLst == null || sent1GrpLst.size() < 1)
+ if (sent1GrpLst == null || sent1GrpLst.isEmpty())
return results;
List<ParseTreeChunk> wholeSentence = sent1GrpLst
.get(sent1GrpLst.size() - 1); // whole sentence is last list in the
list
diff --git
a/opennlp-similarity/src/main/java/opennlp/tools/textsimilarity/chunker2matcher/ParserPure2MatcherProcessor.java
b/opennlp-similarity/src/main/java/opennlp/tools/textsimilarity/chunker2matcher/ParserPure2MatcherProcessor.java
index 2e21705..c5e5dca 100644
---
a/opennlp-similarity/src/main/java/opennlp/tools/textsimilarity/chunker2matcher/ParserPure2MatcherProcessor.java
+++
b/opennlp-similarity/src/main/java/opennlp/tools/textsimilarity/chunker2matcher/ParserPure2MatcherProcessor.java
@@ -33,9 +33,13 @@
package opennlp.tools.textsimilarity.chunker2matcher;
+import java.io.IOException;
+import java.lang.invoke.MethodHandles;
import java.util.ArrayList;
import java.util.List;
-import java.util.logging.Logger;
+
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
import opennlp.tools.textsimilarity.LemmaPair;
import opennlp.tools.textsimilarity.ParseTreeChunk;
@@ -44,9 +48,10 @@ import opennlp.tools.textsimilarity.SentencePairMatchResult;
import opennlp.tools.textsimilarity.TextProcessor;
public class ParserPure2MatcherProcessor extends
ParserChunker2MatcherProcessor {
+
+ private static final Logger LOG =
LoggerFactory.getLogger(MethodHandles.lookup().lookupClass());
+
protected static ParserPure2MatcherProcessor pinstance;
- private static final Logger LOG = Logger
-
.getLogger("opennlp.tools.textsimilarity.chunker2matcher.ParserPure2MatcherProcessor");
public synchronized static ParserPure2MatcherProcessor getInstance() {
if (pinstance == null)
@@ -56,10 +61,14 @@ public class ParserPure2MatcherProcessor extends
ParserChunker2MatcherProcessor
}
private ParserPure2MatcherProcessor() {
- initializeSentenceDetector();
- initializeTokenizer();
- initializePosTagger();
- initializeParser();
+ try {
+ initializeSentenceDetector();
+ initializeTokenizer();
+ initializePosTagger();
+ initializeParser();
+ } catch (IOException e) {
+ LOG.warn("A model can't be loaded: {}", e.getMessage());
+ }
}
public synchronized List<List<ParseTreeChunk>>
formGroupedPhrasesFromChunksForSentence(
@@ -70,7 +79,7 @@ public class ParserPure2MatcherProcessor extends
ParserChunker2MatcherProcessor
sentence = TextProcessor.removePunctuation(sentence);
SentenceNode node = parseSentenceNode(sentence);
if (node == null) {
- LOG.info("Problem parsing sentence '" + sentence);
+ LOG.info("Problem parsing sentence '{}'", sentence);
return null;
}
List<ParseTreeChunk> ptcList = node.getParseTreeChunkList();
@@ -78,7 +87,8 @@ public class ParserPure2MatcherProcessor extends
ParserChunker2MatcherProcessor
List<String> TokList = node.getOrderedLemmaList();
List<List<ParseTreeChunk>> listOfChunks = new ArrayList<>();
- List<ParseTreeChunk> nounPhr = new ArrayList<>(), prepPhr = new
ArrayList<>(), verbPhr = new ArrayList<>(), adjPhr = new ArrayList<>(),
+ List<ParseTreeChunk> nounPhr = new ArrayList<>(), prepPhr = new
ArrayList<>(),
+ verbPhr = new ArrayList<>(), adjPhr = new
ArrayList<>(),
// to store the whole sentence
wholeSentence = new ArrayList<>();
@@ -112,11 +122,7 @@ public class ParserPure2MatcherProcessor extends
ParserChunker2MatcherProcessor
List<List<ParseTreeChunk>> sent1GrpLst =
formGroupedPhrasesFromChunksForPara(para1), sent2GrpLst =
formGroupedPhrasesFromChunksForPara(para2);
- List<LemmaPair> origChunks1 =
listListParseTreeChunk2ListLemmaPairs(sent1GrpLst); // TODO
-
// need
-
// to
-
// populate
-
// it!
+ List<LemmaPair> origChunks1 =
listListParseTreeChunk2ListLemmaPairs(sent1GrpLst);
ParseTreeMatcherDeterministic md = new ParseTreeMatcherDeterministic();
List<List<ParseTreeChunk>> res = md
@@ -126,16 +132,13 @@ public class ParserPure2MatcherProcessor extends
ParserChunker2MatcherProcessor
}
public static void main(String[] args) throws Exception {
- ParserPure2MatcherProcessor parser = ParserPure2MatcherProcessor
- .getInstance();
+ ParserPure2MatcherProcessor parser =
ParserPure2MatcherProcessor.getInstance();
String text = "Its classy design and the Mercedes name make it a very cool
vehicle to drive. ";
List<List<ParseTreeChunk>> res = parser
.formGroupedPhrasesFromChunksForPara(text);
System.out.println(res);
- // System.exit(0);
-
String phrase1 = "Its classy design and the Mercedes name make it a very
cool vehicle to drive. "
+ "The engine makes it a powerful car. "
+ "The strong engine gives it enough power. "
@@ -145,18 +148,15 @@ public class ParserPure2MatcherProcessor extends
ParserChunker2MatcherProcessor
+ "This car provides you a very good mileage.";
String sentence = "Not to worry with the 2cv.";
- System.out.println(parser.assessRelevance(phrase1, phrase2)
- .getMatchResult());
-
- System.out
- .println(parser
- .formGroupedPhrasesFromChunksForSentence("Its classy design and
the Mercedes name make it a very cool vehicle to drive. "));
- System.out
- .println(parser
- .formGroupedPhrasesFromChunksForSentence("Sounds too good to be
true but it actually is, the world's first flying car is finally here. "));
- System.out
- .println(parser
- .formGroupedPhrasesFromChunksForSentence("UN Ambassador Ron Prosor
repeated the Israeli position that the only way the Palestinians will get UN
membership and statehood is through direct negotiations with the Israelis on a
comprehensive peace agreement"));
+ System.out.println(parser.assessRelevance(phrase1,
phrase2).getMatchResult());
+
+ System.out.println(parser.formGroupedPhrasesFromChunksForSentence(
+ "Its classy design and the Mercedes name make it a very cool
vehicle to drive. "));
+ System.out.println(parser.formGroupedPhrasesFromChunksForSentence(
+ "Sounds too good to be true but it actually is, the world's first
flying car is finally here. "));
+ System.out.println(parser.formGroupedPhrasesFromChunksForSentence(
+ "UN Ambassador Ron Prosor repeated the Israeli position that the
only way the Palestinians will get " +
+ "UN membership and statehood is through direct negotiations with
the Israelis on a comprehensive peace agreement"));
}
}
diff --git a/opennlp-similarity/src/test/resources/models/en-sent.bin
b/opennlp-similarity/src/test/resources/models/en-sent.bin
deleted file mode 100644
index e89076b..0000000
Binary files a/opennlp-similarity/src/test/resources/models/en-sent.bin and
/dev/null differ
diff --git a/pom.xml b/pom.xml
index c2f4a52..e98b18d 100644
--- a/pom.xml
+++ b/pom.xml
@@ -158,22 +158,38 @@
<artifactId>slf4j-api</artifactId>
<version>${slf4j.version}</version>
</dependency>
+ <dependency>
+ <groupId>org.slf4j</groupId>
+ <artifactId>log4j-over-slf4j</artifactId>
+ <version>${slf4j.version}</version>
+ <scope>runtime</scope>
+ </dependency>
<dependency>
<groupId>commons-lang</groupId>
<artifactId>commons-lang</artifactId>
<version>2.6</version>
</dependency>
+ <dependency>
+ <groupId>commons-io</groupId>
+ <artifactId>commons-io</artifactId>
+ <version>2.18.0</version>
+ </dependency>
<dependency>
<groupId>org.apache.commons</groupId>
<artifactId>commons-lang3</artifactId>
- <version>3.12.0</version>
+ <version>3.17.0</version>
</dependency>
<dependency>
<groupId>commons-codec</groupId>
<artifactId>commons-codec</artifactId>
<version>1.15</version>
</dependency>
+ <dependency>
+ <groupId>org.apache.commons</groupId>
+ <artifactId>commons-mat3</artifactId>
+ <version>3.6.1</version>
+ </dependency>
<dependency>
<groupId>commons-logging</groupId>
<artifactId>commons-logging</artifactId>