Hi,

we have created a spark program to prove the feasibility of adding the RNN
algorithm to machine learner.
This program demonstrates all the steps in machine learner:

Uploading a dataset

Selecting the hyper parameters for the model

Creating a RNN model using data and training the model

Calculating the accuracy of the model

Saving the model(As a serialization object)

predicting using the model

This program is based on deeplearning4j and apache spark pipeline.
Deeplearning4j was used as the deep learning library for recurrent neural
network algorithm. As the program should be based on the Spark pipeline,
the main challenge was to use deeplearning4j library with spark pipeline.
The components used in the spark pipeline should be compatible with spark
pipeline. For other components which are not compatible with spark
pipeline, we have to wrap them with a org.apache.spark.predictionModel
object.

We have designed a pipeline with sequence of stages (transformers and
estimators):

1. Tokenizer:Transformer-Split each sequential data to tokens.(For example,
in sentiment analysis, split text into words)

2. Vectorizer :Transformer-Transforms features into vectors.

3. RNN algorithm :Estimator -RNN algorithm which trains on a data frame and
produces a RNN model

4. RNN model : Transformer- Transforms data frame with features to data
frame with predictions.

The diagrams below explains the stages of the pipeline. The first diagram
illustrates the training usage of the pipeline and the next diagram
illustrates the testing and predicting usage of a pipeline.


​


​


I also have tuned the RNN model for hyper parameters[1] and found the
values of hyper parameters which optimizes accuracy of the model.
Give below is the set of hyper parameters relevant to RNN algorithm and the
tuned values.


Number of epochs-10

Number of iterations- 1

Learning rate-0.02

We used the aclImdb sentiment analysis data set for this program and with
the above hyper parameters, we could achieve 60% accuracy. And we are
trying to improve the accuracy and efficiency of our algorithm.

[1]
https://docs.google.com/spreadsheets/d/1Wcta6i2k4Je_5l16wCVlH6zBMNGIb-d7USaWdbrkrSw/edit?ts=56fcdc9b#gid=2118685173


Thanks



On Fri, Mar 25, 2016 at 10:18 AM, Thamali Wijewardhana <[email protected]>
wrote:

> Hi all,
>
> One of the most important obstacles in machine learning and deep learning
> is getting data into a format that neural nets can understand. Neural nets
> understand vectors. Therefore, vectorization is an important part in
> building neural network algorithms.
>
> Canova is a Vectorization library for Machine Learning which is associated
> with deeplearning4j library. It is designed to support all major types of
> input data such as text,csv,image,audio,video and etc.
>
> In our project to add RNN for Machine Learner, we have to use a
> vectorizing component to convert input data to vectors. I think that Canova
> is a better to build a generic vectorizing component. I am researching on
> using Canova for the vectorizing purpose.
>
> Any suggestions on this are highly appreciated.
>
>
> Thanks
>
>
>
> On Wed, Mar 2, 2016 at 2:25 PM, Thamali Wijewardhana <[email protected]>
> wrote:
>
>> Hi Srinath,
>>
>> We have decided to  implement only classification first. Once we complete
>> the classification, we hope to do next value prediction too.
>> We are basically trying to implement a program to make sure that the
>> deeplearning4j library we are using is compatible with apache spark
>> pipeline. And also we are trying to demonstrate all the machine learning
>> steps with that program.
>>
>> We are now using aclImdb sentiment analysis data set to verify the
>> accuracy of the RNN model we create.
>>
>> Thanks
>> Thamali
>>
>>
>> On Wed, Mar 2, 2016 at 10:38 AM, Srinath Perera <[email protected]> wrote:
>>
>>> Hi Thamali,
>>>
>>>
>>>    1. RNN can do both classification and predict next value. Are we
>>>    trying to do both?
>>>    2. When Upul played with it, he had trouble getting deeplearning4j
>>>    implementation work with predict next value scenario. Is it fixed?
>>>    3. What are the data sets we will use to verify the accuracy of RNN
>>>    after integration?
>>>
>>>
>>> --Srinath
>>>
>>> On Tue, Mar 1, 2016 at 3:44 PM, Thamali Wijewardhana <[email protected]>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> Currently we are working on a project to add Recurrent Neural
>>>> Network(RNN) algorithm to machine learner. RNN is one of deep learning
>>>> algorithms with record breaking accuracy. For more information on RNN
>>>> please refer link[1].
>>>>
>>>> We have decided to use deeplearning4j which is an open source deep
>>>> learning library scalable on spark and Hadoop.
>>>>
>>>> Since there is a plan to add spark pipeline to machine Learner, we have
>>>> decided to use spark pipeline concept to our project.
>>>>
>>>> I have designed an architecture for the RNN implementation.
>>>>
>>>> This architecture is developed to be compatible with spark pipeline.
>>>>
>>>> Data set is taken in csv format and then it is converted to spark data
>>>> frame since apache spark works mostly with data frames.
>>>>
>>>> Next step is a transformer which is needed to tokenize the sequential
>>>> data. A tokenizer is basically used for take a sequence of data and break
>>>> it into individual units. For example, it can be used to break the words in
>>>> a sentence to words.
>>>>
>>>> Next step is again a transformer used to converts tokens to vectors.
>>>> This must be done because the features should be added to spark pipeline in
>>>> org.apache.spark.mllib.linlag.VectorUDT format.
>>>>
>>>> Next, the transformed data are fed to the data set iterator. This is an
>>>> object of a class which implement
>>>> org.deeplearning4j.datasets.iterator.DataSetIterator. The dataset iterator
>>>> traverses through a data set and prepares data for neural networks.
>>>>
>>>> Next component is the RNN algorithm model which is an estimator. The
>>>> iterated data from data set iterator is fed to RNN and a model is
>>>> generated. Then this model can be used for predictions.
>>>>
>>>> We have decided to complete this project in two steps :
>>>>
>>>>
>>>>    -
>>>>
>>>>    First create a spark pipeline program containing the steps in
>>>>    machine learner(uploading dataset, generate model, calculating accuracy 
>>>> and
>>>>    prediction) and check whether the project is feasible.
>>>>    -
>>>>
>>>>    Next add the algorithm to ML
>>>>
>>>> Currently we have almost completed the first step and now we are
>>>> collecting more data and tuning for hyper parameters.
>>>>
>>>> [1]
>>>> https://docs.google.com/document/d/1edg1fdKCYR7-B1oOLy2kon179GSs6x2Zx9oSRDn_NEU/edit
>>>>
>>>>
>>>>
>>>> ​
>>>>
>>>
>>>
>>>
>>> --
>>> ============================
>>> Srinath Perera, Ph.D.
>>>    http://people.apache.org/~hemapani/
>>>    http://srinathsview.blogspot.com/
>>>
>>
>>
>
_______________________________________________
Architecture mailing list
[email protected]
https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture

Reply via email to