GitHub user thammegowda opened a pull request:
https://github.com/apache/opennlp-sandbox/pull/3
text sequence classification using Glove and RNN/LSTMs
Summary:
+ Added a dataset reader for feeding mini batches to DL4J's network
+ Golve embeddings to vectorize text using Stanford NLP group's pre
trained Glove vectors
+ A tiny (2 layer) classifier based on LSTMs
+ All of the above are written hoping to reuse them for other multi class
text classifiers. We can customize these easily:
+ number of classes
+ vector embeddings
+ vocabulary size
+ number of LSTM cells, batch size, etc..
### Known issues:
When learning rate is too low or too high, the gradients quickly bounce to
Infinity or NaN.
Based on the dataset and vectors, the `-lr` parameter should be tuned.
### Datasets
```
# Download pre trained Glo-ves (this is a large file)
wget http://nlp.stanford.edu/data/glove.6B.zip
unzip glove.6B.zip -d glove.6B
# Download dataset
wget http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
tar xzf aclImdb_v1.tar.gz
```
Note: try it out on smaller datasets first. Suggestion: create
`aclImdb/train-lite` and `aclImdb/test-lite` with 1500 positive and 1500
negative examples each from IMDB dataset.
### Data Set Organization:
In general, organize the directory as follows:
```
data-dir/
+ train/
| +- label1 /
| | +- example11.txt
| | +- example12.txt
| | +- example13.txt
| | +- .....
| +- label2 /
| | +- example21.txt
| | +- .....
| +- labelN /
| +- exampleN1.txt
| +- .....
+ test/
+ label1/
+- ........
```
Note: IMDB large dataset is already shipped in this format, just reduce
file count for quicker testing
```
alias RUN="mvn compile exec:java
-Dexec.mainClass=opennlp.tools.dl.GloveRNNTextClassifier"
```
## Train
```bash
RUN -Dexec.args="-glovesPath glove.6B/glove.6B.100d.txt \
-labels pos neg -modelPath imdb-sentimodel.dat \
-trainDir=aclImdb/train-lite -lr 0.001"
```
## Predict
```bash
RUN -Dexec.args="-glovesPath glove.6B/glove.6B.100d.txt \
-labels pos neg -modelPath imdb-sentimodel.dat \
-files /aclImdb/test/pos/1_10.txt datasets/aclImdb/test/neg/1_3.txt"
```
---
## CLI Arguments and Default values to GloveRNNTextClassifier:
```
-batchSize N : Number of examples in minibatch. Applicable for
training
only. (default: 128)
-files STRING[] : File paths (separated by space) to predict using the
model.
-glovesPath VAL : Path to GloVe vectors file. Download and unzip from
https://nlp.stanford.edu/projects/glove/
-labels STRING[] : Names of targets or labels separated by spaces. The
order
of labels matters. Make sure to use the same sequence
for
training and predicting. Also, these names should
match
subdirectory names of -trainDir and -validDir when
those
are applicable.
Example -labels pos neg
-lr (-learnRate) N : Learning Rate. Adjust it when the scores bounce to
NaN or
Infinity. (default: 0.002)
-maxSeqLen N : Max Sequence Length. Sequences longer than this will
be
truncated (default: 256)
-modelPath VAL : Path to model file. This will be used for serializing
the
model after the training phase.and also the model
will be
restored from here for prediction
-nEpochs N : Number of epochs (i.e. full passes over the training
data) to train on. Applicable for training only.
(default: 2)
-nRNNUnits N : Number of RNN cells to use. Applicable for training
only.
(default: 128)
-trainDir VAL : Path to train data directory. Optional. Setting this
value will take the system to training mode.
-validDir VAL : Path to validation data directory. Optional.
Applicable
only when -trainDir is set.
-vocabSize N : Vocabulary Size. (default: 20000)
```
## References:
+ Glove - https://nlp.stanford.edu/projects/glove/
+ RNN and LSTM support in DL4j - https://deeplearning4j.org/usingrnns
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/thammegowda/opennlp-sandbox
glove-rnn-classifier
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/opennlp-sandbox/pull/3.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #3
----
commit a80f29bc3db7d644c412ec12e578181855c9d0cf
Author: Thamme Gowda <[email protected]>
Date: 2017-07-01T23:38:31Z
text sequence classification using Glove and RNN/LSTMs
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---