Re: [Architecture] Adding RNN to WSO2 Machine Learner

Thamali Wijewardhana Fri, 08 Apr 2016 06:51:18 -0700

Hi,

I have used a dataset with 25000 rows and the size is 80 MB.


The link to the dataset is:

http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz




On Fri, Apr 8, 2016 at 3:07 PM, Srinath Perera <[email protected]> wrote:

> Thamali, how big is the data set you are using?  ( give me a link to the
> data set as well).
>
> Nirmal, shall we compare the accuracy of RNN vs. Upul's rolling window
> method?
>
> --Srinath
>
> On Fri, Apr 8, 2016 at 9:23 AM, Thamali Wijewardhana <[email protected]>
> wrote:
>
>> Hi,
>>
>> I run the RNN algorithm using deeplearning4j library and the Keras python
>> library. The dataset, hyper parameters, network architecture and the
>> hardware platform are the same. Given below is the time comparison
>>
>> Deeplearning4j library-40 minutes per 1 epoch
>> Keras library- 4 minutes per 1 epoch
>>
>> I also compared the accuracies[1]. The deeplearning4j library gives a low
>> accuracy compared to Keras library.
>>
>> [1]
>> https://docs.google.com/spreadsheets/d/1-EvC1P7N90k1S_Ly6xVcFlEEKprh7r41Yk8aI6DiSaw/edit#gid=1050346562
>>
>> Thanks
>>
>>
>>
>> On Fri, Apr 1, 2016 at 10:12 AM, Thamali Wijewardhana <[email protected]>
>> wrote:
>>
>>> Hi,
>>> I have organized a review on Monday (4th  of April).
>>>
>>> Thanks
>>>
>>> On Thu, Mar 31, 2016 at 3:21 PM, Srinath Perera <[email protected]>
>>> wrote:
>>>
>>>> Please setup a review. Shall we do it monday?
>>>>
>>>> On Thu, Mar 31, 2016 at 2:15 PM, Thamali Wijewardhana <[email protected]
>>>> > wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> we have created a spark program to prove the feasibility of adding the
>>>>> RNN algorithm to machine learner.
>>>>> This program demonstrates all the steps in machine learner:
>>>>>
>>>>> Uploading a dataset
>>>>>
>>>>> Selecting the hyper parameters for the model
>>>>>
>>>>> Creating a RNN model using data and training the model
>>>>>
>>>>> Calculating the accuracy of the model
>>>>>
>>>>> Saving the model(As a serialization object)
>>>>>
>>>>> predicting using the model
>>>>>
>>>>> This program is based on deeplearning4j and apache spark pipeline.
>>>>> Deeplearning4j was used as the deep learning library for recurrent neural
>>>>> network algorithm. As the program should be based on the Spark pipeline,
>>>>> the main challenge was to use deeplearning4j library with spark pipeline.
>>>>> The components used in the spark pipeline should be compatible with spark
>>>>> pipeline. For other components which are not compatible with spark
>>>>> pipeline, we have to wrap them with a org.apache.spark.predictionModel
>>>>> object.
>>>>>
>>>>> We have designed a pipeline with sequence of stages (transformers and
>>>>> estimators):
>>>>>
>>>>> 1. Tokenizer:Transformer-Split each sequential data to tokens.(For
>>>>> example, in sentiment analysis, split text into words)
>>>>>
>>>>> 2. Vectorizer :Transformer-Transforms features into vectors.
>>>>>
>>>>> 3. RNN algorithm :Estimator -RNN algorithm which trains on a data
>>>>> frame and produces a RNN model
>>>>>
>>>>> 4. RNN model : Transformer- Transforms data frame with features to
>>>>> data frame with predictions.
>>>>>
>>>>> The diagrams below explains the stages of the pipeline. The first
>>>>> diagram illustrates the training usage of the pipeline and the next 
>>>>> diagram
>>>>> illustrates the testing and predicting usage of a pipeline.
>>>>>
>>>>>
>>>>> 
>>>>>
>>>>>
>>>>> 
>>>>>
>>>>>
>>>>> I also have tuned the RNN model for hyper parameters[1] and found the
>>>>> values of hyper parameters which optimizes accuracy of the model.
>>>>> Give below is the set of hyper parameters relevant to RNN algorithm
>>>>> and the tuned values.
>>>>>
>>>>>
>>>>> Number of epochs-10
>>>>>
>>>>> Number of iterations- 1
>>>>>
>>>>> Learning rate-0.02
>>>>>
>>>>> We used the aclImdb sentiment analysis data set for this program and
>>>>> with the above hyper parameters, we could achieve 60% accuracy. And we are
>>>>> trying to improve the accuracy and efficiency of our algorithm.
>>>>>
>>>>> [1]
>>>>> https://docs.google.com/spreadsheets/d/1Wcta6i2k4Je_5l16wCVlH6zBMNGIb-d7USaWdbrkrSw/edit?ts=56fcdc9b#gid=2118685173
>>>>>
>>>>>
>>>>> Thanks
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Mar 25, 2016 at 10:18 AM, Thamali Wijewardhana <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> One of the most important obstacles in machine learning and deep
>>>>>> learning is getting data into a format that neural nets can understand.
>>>>>> Neural nets understand vectors. Therefore, vectorization is an important
>>>>>> part in building neural network algorithms.
>>>>>>
>>>>>> Canova is a Vectorization library for Machine Learning which is
>>>>>> associated with deeplearning4j library. It is designed to support all 
>>>>>> major
>>>>>> types of input data such as text,csv,image,audio,video and etc.
>>>>>>
>>>>>> In our project to add RNN for Machine Learner, we have to use a
>>>>>> vectorizing component to convert input data to vectors. I think that 
>>>>>> Canova
>>>>>> is a better to build a generic vectorizing component. I am researching on
>>>>>> using Canova for the vectorizing purpose.
>>>>>>
>>>>>> Any suggestions on this are highly appreciated.
>>>>>>
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Wed, Mar 2, 2016 at 2:25 PM, Thamali Wijewardhana <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> Hi Srinath,
>>>>>>>
>>>>>>> We have decided to  implement only classification first. Once we
>>>>>>> complete the classification, we hope to do next value prediction too.
>>>>>>> We are basically trying to implement a program to make sure that the
>>>>>>> deeplearning4j library we are using is compatible with apache spark
>>>>>>> pipeline. And also we are trying to demonstrate all the machine learning
>>>>>>> steps with that program.
>>>>>>>
>>>>>>> We are now using aclImdb sentiment analysis data set to verify the
>>>>>>> accuracy of the RNN model we create.
>>>>>>>
>>>>>>> Thanks
>>>>>>> Thamali
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Mar 2, 2016 at 10:38 AM, Srinath Perera <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi Thamali,
>>>>>>>>
>>>>>>>>
>>>>>>>>    1. RNN can do both classification and predict next value. Are
>>>>>>>>    we trying to do both?
>>>>>>>>    2. When Upul played with it, he had trouble getting
>>>>>>>>    deeplearning4j implementation work with predict next value 
>>>>>>>> scenario. Is it
>>>>>>>>    fixed?
>>>>>>>>    3. What are the data sets we will use to verify the accuracy of
>>>>>>>>    RNN after integration?
>>>>>>>>
>>>>>>>>
>>>>>>>> --Srinath
>>>>>>>>
>>>>>>>> On Tue, Mar 1, 2016 at 3:44 PM, Thamali Wijewardhana <
>>>>>>>> [email protected]> wrote:
>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> Currently we are working on a project to add Recurrent Neural
>>>>>>>>> Network(RNN) algorithm to machine learner. RNN is one of deep learning
>>>>>>>>> algorithms with record breaking accuracy. For more information on RNN
>>>>>>>>> please refer link[1].
>>>>>>>>>
>>>>>>>>> We have decided to use deeplearning4j which is an open source deep
>>>>>>>>> learning library scalable on spark and Hadoop.
>>>>>>>>>
>>>>>>>>> Since there is a plan to add spark pipeline to machine Learner, we
>>>>>>>>> have decided to use spark pipeline concept to our project.
>>>>>>>>>
>>>>>>>>> I have designed an architecture for the RNN implementation.
>>>>>>>>>
>>>>>>>>> This architecture is developed to be compatible with spark
>>>>>>>>> pipeline.
>>>>>>>>>
>>>>>>>>> Data set is taken in csv format and then it is converted to spark
>>>>>>>>> data frame since apache spark works mostly with data frames.
>>>>>>>>>
>>>>>>>>> Next step is a transformer which is needed to tokenize the
>>>>>>>>> sequential data. A tokenizer is basically used for take a sequence of 
>>>>>>>>> data
>>>>>>>>> and break it into individual units. For example, it can be used to 
>>>>>>>>> break
>>>>>>>>> the words in a sentence to words.
>>>>>>>>>
>>>>>>>>> Next step is again a transformer used to converts tokens to
>>>>>>>>> vectors. This must be done because the features should be added to 
>>>>>>>>> spark
>>>>>>>>> pipeline in org.apache.spark.mllib.linlag.VectorUDT format.
>>>>>>>>>
>>>>>>>>> Next, the transformed data are fed to the data set iterator. This
>>>>>>>>> is an object of a class which implement
>>>>>>>>> org.deeplearning4j.datasets.iterator.DataSetIterator. The dataset 
>>>>>>>>> iterator
>>>>>>>>> traverses through a data set and prepares data for neural networks.
>>>>>>>>>
>>>>>>>>> Next component is the RNN algorithm model which is an estimator.
>>>>>>>>> The iterated data from data set iterator is fed to RNN and a model is
>>>>>>>>> generated. Then this model can be used for predictions.
>>>>>>>>>
>>>>>>>>> We have decided to complete this project in two steps :
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>    -
>>>>>>>>>
>>>>>>>>>    First create a spark pipeline program containing the steps in
>>>>>>>>>    machine learner(uploading dataset, generate model, calculating 
>>>>>>>>> accuracy and
>>>>>>>>>    prediction) and check whether the project is feasible.
>>>>>>>>>    -
>>>>>>>>>
>>>>>>>>>    Next add the algorithm to ML
>>>>>>>>>
>>>>>>>>> Currently we have almost completed the first step and now we are
>>>>>>>>> collecting more data and tuning for hyper parameters.
>>>>>>>>>
>>>>>>>>> [1]
>>>>>>>>> https://docs.google.com/document/d/1edg1fdKCYR7-B1oOLy2kon179GSs6x2Zx9oSRDn_NEU/edit
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> ============================
>>>>>>>> Srinath Perera, Ph.D.
>>>>>>>>    http://people.apache.org/~hemapani/
>>>>>>>>    http://srinathsview.blogspot.com/
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> ============================
>>>> Blog: http://srinathsview.blogspot.com twitter:@srinath_perera
>>>> Site: http://home.apache.org/~hemapani/
>>>> Photos: http://www.flickr.com/photos/hemapani/
>>>> Phone: 0772360902
>>>>
>>>
>>>
>>
>
>
> --
> ============================
> Blog: http://srinathsview.blogspot.com twitter:@srinath_perera
> Site: http://home.apache.org/~hemapani/
> Photos: http://www.flickr.com/photos/hemapani/
> Phone: 0772360902
>

_______________________________________________
Architecture mailing list
[email protected]
https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture

Re: [Architecture] Adding RNN to WSO2 Machine Learner

Reply via email to