[GitHub] opennlp-sandbox pull request #3: text sequence classification using Glove an...

thammegowda Sat, 01 Jul 2017 16:50:35 -0700

GitHub user thammegowda opened a pull request:

    https://github.com/apache/opennlp-sandbox/pull/3


    text sequence classification using Glove and RNN/LSTMs

    Summary: 
    + Added a dataset reader for feeding mini batches to DL4J's network
    +  Golve embeddings to vectorize text using Stanford NLP group's pre 
trained Glove vectors
    + A tiny (2 layer) classifier based on LSTMs
    + All of the above are written hoping to reuse them for other multi class 
text classifiers. We can customize these easily:
      + number of classes 
      + vector embeddings
      + vocabulary size
      + number of LSTM cells, batch size, etc..
    
    ### Known issues:
     When learning rate is too low or too high, the gradients quickly bounce to 
Infinity or NaN.
    Based on the dataset and vectors,  the `-lr` parameter should be tuned.
    
    ### Datasets 
    ```
    # Download pre trained Glo-ves (this is a large file)
     wget http://nlp.stanford.edu/data/glove.6B.zip
    unzip glove.6B.zip -d glove.6B
    
    # Download dataset
     wget http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
    tar xzf aclImdb_v1.tar.gz
    ```
    
    Note: try it out on smaller datasets first.  Suggestion: create 
`aclImdb/train-lite` and `aclImdb/test-lite` with 1500 positive and 1500 
negative examples each from IMDB dataset. 
    
    ### Data Set Organization:
    In general, organize the directory as follows:
    ```
    data-dir/
        + train/
        |  +- label1 /
        |  |    +- example11.txt
        |  |    +- example12.txt
        |  |    +- example13.txt
        |  |    +- .....
        |  +- label2 /
        |  |    +- example21.txt
        |  |    +- .....
        |  +- labelN /
        |       +- exampleN1.txt
        |       +- .....
        + test/
             + label1/
                  +- ........
    ```
    Note:  IMDB large dataset is already shipped in this format, just reduce 
file count for quicker testing
    
    ```
    alias RUN="mvn compile exec:java 
-Dexec.mainClass=opennlp.tools.dl.GloveRNNTextClassifier"
    ```
    
    ## Train 
    ```bash
    RUN -Dexec.args="-glovesPath glove.6B/glove.6B.100d.txt  \
        -labels pos neg -modelPath imdb-sentimodel.dat \
        -trainDir=aclImdb/train-lite -lr 0.001"
    ```
    
    
    ## Predict 
    ```bash
    RUN  -Dexec.args="-glovesPath glove.6B/glove.6B.100d.txt \
          -labels pos neg -modelPath imdb-sentimodel.dat \
           -files /aclImdb/test/pos/1_10.txt datasets/aclImdb/test/neg/1_3.txt"
    ```
    
    ---
    ## CLI Arguments and Default values to GloveRNNTextClassifier:
    ```
     -batchSize N       : Number of examples in minibatch. Applicable for 
training
                          only. (default: 128)
     -files STRING[]    : File paths (separated by space) to predict using the
                          model.
     -glovesPath VAL    : Path to GloVe vectors file. Download and unzip from
                          https://nlp.stanford.edu/projects/glove/
     -labels STRING[]   : Names of targets or labels separated by spaces. The 
order
                          of labels matters. Make sure to use the same sequence 
for
                          training and predicting. Also, these names should 
match
                          subdirectory names of -trainDir and -validDir when 
those
                          are applicable.
                           Example -labels pos neg
     -lr (-learnRate) N : Learning Rate. Adjust it when the scores bounce to 
NaN or
                          Infinity. (default: 0.002)
     -maxSeqLen N       : Max Sequence Length. Sequences longer than this will 
be
                          truncated (default: 256)
     -modelPath VAL     : Path to model file. This will be used for serializing 
the
                          model after the training phase.and also the model 
will be
                          restored from here for prediction
     -nEpochs N         : Number of epochs (i.e. full passes over the training
                          data) to train on. Applicable for training only.
                          (default: 2)
     -nRNNUnits N       : Number of RNN cells to use. Applicable for training 
only.
                          (default: 128)
     -trainDir VAL      : Path to train data directory. Optional. Setting this
                          value will take the system to training mode.
     -validDir VAL      : Path to validation data directory. Optional. 
Applicable
                          only when -trainDir is set.
     -vocabSize N       : Vocabulary Size. (default: 20000)
    
    ```
    
    ## References:
    + Glove - https://nlp.stanford.edu/projects/glove/
    + RNN and LSTM support in DL4j - https://deeplearning4j.org/usingrnns


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/thammegowda/opennlp-sandbox 
glove-rnn-classifier

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/opennlp-sandbox/pull/3.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #3
    
----
commit a80f29bc3db7d644c412ec12e578181855c9d0cf
Author: Thamme Gowda <[email protected]>
Date:   2017-07-01T23:38:31Z

    text sequence classification using Glove and RNN/LSTMs

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

[GitHub] opennlp-sandbox pull request #3: text sequence classification using Glove an...

Reply via email to