1.8.1 release

2017-07-01 Thread Joern Kottmann
Dear all,

We will be making a 1.8.1 release of OpenNLP  in the next days. All
issues in jira are closed now.

Jörn


[VOTE] Apache OpenNLP 1.8.1 Release Candidate

2017-07-01 Thread Suneel Marthi
The Apache OpenNLP PMC would like to call for a Vote on Apache OpenNLP 1.8.1
Release Candidate.

The Release artifacts can be downloaded from:

https://repository.apache.org/content/repositories/orgapacheopennlp-1014/org/apache/opennlp/opennlp-distr/1.8.1/

The release was made from the Apache OpenNLP 1.8.1 tag at

https://github.com/apache/opennlp/tree/opennlp-1.8.1

To use it in a maven build set the version for opennlp-tools or opennlp-uima
to 1.8.1

and add the following URL to your settings.xml file:

https://repository.apache.org/content/repositories/orgapacheopennlp-1014

The artifacts have been signed with the Key - D3541808 found at

http://people.apache.org/keys/group/opennlp.asc

Please vote on releasing these packages as Apache OpenNLP 1.8.1. The vote is

open for the next 72 hours *ending on Monday, July 3 11AM EST *.

Only votes from OpenNLP PMC are binding, but folks are welcome to check the

release candidate and voice their approval or disapproval. The vote passes

if at least three binding +1 votes are cast.

[ ] +1 Release the packages as Apache OpenNLP 1.8.1

[ ] -1 Do not release the packages because...

Thanks again to all the committers and contributors for their work
over the past
few weeks.


Re: [VOTE] Apache OpenNLP 1.8.1 Release Candidate

2017-07-01 Thread Suneel Marthi
Here's my +1 binding

1. Verified the sigs and hashsums
2. Ran a clean build of {src} * {zip, tar} and all unit tests pass
3. Verified RAT check

On Sat, Jul 1, 2017 at 11:20 AM, Suneel Marthi  wrote:

> The Apache OpenNLP PMC would like to call for a Vote on Apache OpenNLP 1.8.1
> Release Candidate.
>
> The Release artifacts can be downloaded from:
>
> https://repository.apache.org/content/repositories/
> orgapacheopennlp-1014/org/apache/opennlp/opennlp-distr/1.8.1/
>
> The release was made from the Apache OpenNLP 1.8.1 tag at
>
> https://github.com/apache/opennlp/tree/opennlp-1.8.1
>
> To use it in a maven build set the version for opennlp-tools or opennlp-uima
> to 1.8.1
>
> and add the following URL to your settings.xml file:
>
> https://repository.apache.org/content/repositories/orgapacheopennlp-1014
>
> The artifacts have been signed with the Key - D3541808 found at
>
> http://people.apache.org/keys/group/opennlp.asc
>
> Please vote on releasing these packages as Apache OpenNLP 1.8.1. The vote
>  is
>
> open for the next 72 hours *ending on Monday, July 3 11AM EST *.
>
> Only votes from OpenNLP PMC are binding, but folks are welcome to check
> the
>
> release candidate and voice their approval or disapproval. The vote passes
>
> if at least three binding +1 votes are cast.
>
> [ ] +1 Release the packages as Apache OpenNLP 1.8.1
>
> [ ] -1 Do not release the packages because...
>
> Thanks again to all the committers and contributors for their work over
> the past few weeks.
>


Spelling correction

2017-07-01 Thread Damiano Porta
Hello everybody,
i am dealing with data normalization on very bad sentences with many
spelling errors.

Do you know a good paper to understand how to build a model that will fix
this kind of problem?
I can share the code without problems if you are interested in integrating
it into OpenNLP.

Thanks
Damiano


Re: Spelling correction

2017-07-01 Thread Suneel Marthi
'Spelling Correction' has been the most popular ask from audience at my
recent NLP talks, it would be great to have this feature in OpenNLP.

I am not aware of any papers on this, but the first thing that comes to
mind and is irrelevant is the 'Noisy channel'.



On Sat, Jul 1, 2017 at 2:04 PM, Damiano Porta 
wrote:

> Hello everybody,
> i am dealing with data normalization on very bad sentences with many
> spelling errors.
>
> Do you know a good paper to understand how to build a model that will fix
> this kind of problem?
> I can share the code without problems if you are interested in integrating
> it into OpenNLP.
>
> Thanks
> Damiano
>


Re: Spelling correction

2017-07-01 Thread Damiano Porta
I also read about Noisy channel. I could work on this if you think it is
good.

Damiano

Il 1 lug 2017 20:16, "Suneel Marthi"  ha scritto:

> 'Spelling Correction' has been the most popular ask from audience at my
> recent NLP talks, it would be great to have this feature in OpenNLP.
>
> I am not aware of any papers on this, but the first thing that comes to
> mind and is irrelevant is the 'Noisy channel'.
>
>
>
> On Sat, Jul 1, 2017 at 2:04 PM, Damiano Porta 
> wrote:
>
> > Hello everybody,
> > i am dealing with data normalization on very bad sentences with many
> > spelling errors.
> >
> > Do you know a good paper to understand how to build a model that will fix
> > this kind of problem?
> > I can share the code without problems if you are interested in
> integrating
> > it into OpenNLP.
> >
> > Thanks
> > Damiano
> >
>


Re: Spelling correction

2017-07-01 Thread Suneel Marthi
+1

On Sat, Jul 1, 2017 at 2:43 PM, Damiano Porta 
wrote:

> I also read about Noisy channel. I could work on this if you think it is
> good.
>
> Damiano
>
> Il 1 lug 2017 20:16, "Suneel Marthi"  ha scritto:
>
> > 'Spelling Correction' has been the most popular ask from audience at my
> > recent NLP talks, it would be great to have this feature in OpenNLP.
> >
> > I am not aware of any papers on this, but the first thing that comes to
> > mind and is irrelevant is the 'Noisy channel'.
> >
> >
> >
> > On Sat, Jul 1, 2017 at 2:04 PM, Damiano Porta 
> > wrote:
> >
> > > Hello everybody,
> > > i am dealing with data normalization on very bad sentences with many
> > > spelling errors.
> > >
> > > Do you know a good paper to understand how to build a model that will
> fix
> > > this kind of problem?
> > > I can share the code without problems if you are interested in
> > integrating
> > > it into OpenNLP.
> > >
> > > Thanks
> > > Damiano
> > >
> >
>


Re: [VOTE] Apache OpenNLP 1.8.1 Release Candidate

2017-07-01 Thread Richard Eckart de Castilho
Hi all,

I ran a DKPro Core build against the RC. Looks mostly fine. No code changes
are required after switching from 1.8.0 to 1.8.1. All unit tests except one
run as before.

I can observer a change when training a sentence splitter model.

With 1.8.0, I get

  F-score 0.937518
  Precision   0.932157
  Recall  0.942941

With 1.8.1, I get

  F-score 0.922556
  Precision   0.909975
  Recall  0.935490

I am using the germeval-2014 data for training.

It is not a big drop, but it still is a change - maybe an undesired one?

Best,

-- Richard



Re: Spelling correction

2017-07-01 Thread Suneel Marthi
u could also leverage Language Models for spell correction, OpenNLP has
stupid-backoff implementation - create a language model with that algorithm
and use that for spell checks.

On Sat, Jul 1, 2017 at 2:43 PM, Damiano Porta 
wrote:

> I also read about Noisy channel. I could work on this if you think it is
> good.
>
> Damiano
>
> Il 1 lug 2017 20:16, "Suneel Marthi"  ha scritto:
>
> > 'Spelling Correction' has been the most popular ask from audience at my
> > recent NLP talks, it would be great to have this feature in OpenNLP.
> >
> > I am not aware of any papers on this, but the first thing that comes to
> > mind and is irrelevant is the 'Noisy channel'.
> >
> >
> >
> > On Sat, Jul 1, 2017 at 2:04 PM, Damiano Porta 
> > wrote:
> >
> > > Hello everybody,
> > > i am dealing with data normalization on very bad sentences with many
> > > spelling errors.
> > >
> > > Do you know a good paper to understand how to build a model that will
> fix
> > > this kind of problem?
> > > I can share the code without problems if you are interested in
> > integrating
> > > it into OpenNLP.
> > >
> > > Thanks
> > > Damiano
> > >
> >
>


Re: Spelling correction

2017-07-01 Thread Daniel Russ
Damiano,

There is a lot of research on spelling correction.  Here is a paper from a 
group our of the National Library of Medicine
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2137159/ 
.   They also have a 
product called GSpell 
https://lexsrv3.nlm.nih.gov/LexSysGroup/Projects/gSpell/current/GSpell.html 
 
which uses the NLM lexicon.  It might not work of OpenNLP (too english-based) 
but things to look into.  I dabble into the spelling correction field, but have 
not worked serious in it.  I’d be willing to help on this project, but i don’t 
have a lot of time.

Daniel


> On Jul 1, 2017, at 7:20 PM, Suneel Marthi  wrote:
> 
> u could also leverage Language Models for spell correction, OpenNLP has
> stupid-backoff implementation - create a language model with that algorithm
> and use that for spell checks.
> 
> On Sat, Jul 1, 2017 at 2:43 PM, Damiano Porta 
> wrote:
> 
>> I also read about Noisy channel. I could work on this if you think it is
>> good.
>> 
>> Damiano
>> 
>> Il 1 lug 2017 20:16, "Suneel Marthi"  ha scritto:
>> 
>>> 'Spelling Correction' has been the most popular ask from audience at my
>>> recent NLP talks, it would be great to have this feature in OpenNLP.
>>> 
>>> I am not aware of any papers on this, but the first thing that comes to
>>> mind and is irrelevant is the 'Noisy channel'.
>>> 
>>> 
>>> 
>>> On Sat, Jul 1, 2017 at 2:04 PM, Damiano Porta 
>>> wrote:
>>> 
 Hello everybody,
 i am dealing with data normalization on very bad sentences with many
 spelling errors.
 
 Do you know a good paper to understand how to build a model that will
>> fix
 this kind of problem?
 I can share the code without problems if you are interested in
>>> integrating
 it into OpenNLP.
 
 Thanks
 Damiano
 
>>> 
>> 



[GitHub] opennlp-sandbox pull request #3: text sequence classification using Glove an...

2017-07-01 Thread thammegowda
GitHub user thammegowda opened a pull request:

https://github.com/apache/opennlp-sandbox/pull/3

text sequence classification using Glove and RNN/LSTMs

Summary: 
+ Added a dataset reader for feeding mini batches to DL4J's network
+  Golve embeddings to vectorize text using Stanford NLP group's pre 
trained Glove vectors
+ A tiny (2 layer) classifier based on LSTMs
+ All of the above are written hoping to reuse them for other multi class 
text classifiers. We can customize these easily:
  + number of classes 
  + vector embeddings
  + vocabulary size
  + number of LSTM cells, batch size, etc..

### Known issues:
 When learning rate is too low or too high, the gradients quickly bounce to 
Infinity or NaN.
Based on the dataset and vectors,  the `-lr` parameter should be tuned.

### Datasets 
```
# Download pre trained Glo-ves (this is a large file)
 wget http://nlp.stanford.edu/data/glove.6B.zip
unzip glove.6B.zip -d glove.6B

# Download dataset
 wget http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
tar xzf aclImdb_v1.tar.gz
```

Note: try it out on smaller datasets first.  Suggestion: create 
`aclImdb/train-lite` and `aclImdb/test-lite` with 1500 positive and 1500 
negative examples each from IMDB dataset. 

### Data Set Organization:
In general, organize the directory as follows:
```
data-dir/
+ train/
|  +- label1 /
|  |+- example11.txt
|  |+- example12.txt
|  |+- example13.txt
|  |+- .
|  +- label2 /
|  |+- example21.txt
|  |+- .
|  +- labelN /
|   +- exampleN1.txt
|   +- .
+ test/
 + label1/
  +- 
```
Note:  IMDB large dataset is already shipped in this format, just reduce 
file count for quicker testing

```
alias RUN="mvn compile exec:java 
-Dexec.mainClass=opennlp.tools.dl.GloveRNNTextClassifier"
```

## Train 
```bash
RUN -Dexec.args="-glovesPath glove.6B/glove.6B.100d.txt  \
-labels pos neg -modelPath imdb-sentimodel.dat \
-trainDir=aclImdb/train-lite -lr 0.001"
```


## Predict 
```bash
RUN  -Dexec.args="-glovesPath glove.6B/glove.6B.100d.txt \
  -labels pos neg -modelPath imdb-sentimodel.dat \
   -files /aclImdb/test/pos/1_10.txt datasets/aclImdb/test/neg/1_3.txt"
```

---
## CLI Arguments and Default values to GloveRNNTextClassifier:
```
 -batchSize N   : Number of examples in minibatch. Applicable for 
training
  only. (default: 128)
 -files STRING[]: File paths (separated by space) to predict using the
  model.
 -glovesPath VAL: Path to GloVe vectors file. Download and unzip from
  https://nlp.stanford.edu/projects/glove/
 -labels STRING[]   : Names of targets or labels separated by spaces. The 
order
  of labels matters. Make sure to use the same sequence 
for
  training and predicting. Also, these names should 
match
  subdirectory names of -trainDir and -validDir when 
those
  are applicable.
   Example -labels pos neg
 -lr (-learnRate) N : Learning Rate. Adjust it when the scores bounce to 
NaN or
  Infinity. (default: 0.002)
 -maxSeqLen N   : Max Sequence Length. Sequences longer than this will 
be
  truncated (default: 256)
 -modelPath VAL : Path to model file. This will be used for serializing 
the
  model after the training phase.and also the model 
will be
  restored from here for prediction
 -nEpochs N : Number of epochs (i.e. full passes over the training
  data) to train on. Applicable for training only.
  (default: 2)
 -nRNNUnits N   : Number of RNN cells to use. Applicable for training 
only.
  (default: 128)
 -trainDir VAL  : Path to train data directory. Optional. Setting this
  value will take the system to training mode.
 -validDir VAL  : Path to validation data directory. Optional. 
Applicable
  only when -trainDir is set.
 -vocabSize N   : Vocabulary Size. (default: 2)

```

## References:
+ Glove - https://nlp.stanford.edu/projects/glove/
+ RNN and LSTM support in DL4j - https://deeplearning4j.org/usingrnns


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/thammegowda/opennlp-sandbox 
glove-rnn-classifier

Alternatively you can review and apply these changes as the