Re: [Architecture] Adding RNN to WSO2 Machine Learner

Imesh Gunaratne Thu, 21 Apr 2016 11:36:01 -0700

Hi Thamali,

One other point, people outside WSO2 might not be able to access the Google
Docs you have shared in this thread. You might need to export them as PDF
and share.


Thanks

On Thu, Apr 21, 2016 at 11:53 PM, Imesh Gunaratne <[email protected]> wrote:

> Hi Thamali,
>
> It might be better if you can share the artifacts you used to execute
> these tests in a public location. May be including a README.md file with
> the steps to be followed.
>
> Thanks
>
> On Thu, Apr 21, 2016 at 6:03 PM, Thamali Wijewardhana <[email protected]>
> wrote:
>
>> Hi,
>>
>> I have completed writing the article[1] containing the comparison between
>> the deeplearning4j library and Keras library considering Recurrent Neural
>> network(RNN) algorithm.
>> I also have found out the reasons for low performance of Deeplearning4j
>> library using Java Flight Recorder(JFR) and Flame Graphs and included in
>> the article.
>>
>> [1]
>> https://docs.google.com/a/wso2.com/document/d/1CGq1y5QBzW6EaHyf-UqAiatxLumb6lo_mRLjYZWD18o/edit?usp=sharing
>>
>> Thanks
>>
>>
>> On Fri, Apr 8, 2016 at 7:20 PM, Thamali Wijewardhana <[email protected]>
>> wrote:
>>
>>> Hi,
>>>
>>> I have used a dataset with 25000 rows and the size is 80 MB.
>>>
>>> The link to the dataset is:
>>>
>>> http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
>>>
>>>
>>>
>>>
>>> On Fri, Apr 8, 2016 at 3:07 PM, Srinath Perera <[email protected]> wrote:
>>>
>>>> Thamali, how big is the data set you are using?  ( give me a link to
>>>> the data set as well).
>>>>
>>>> Nirmal, shall we compare the accuracy of RNN vs. Upul's rolling window
>>>> method?
>>>>
>>>> --Srinath
>>>>
>>>> On Fri, Apr 8, 2016 at 9:23 AM, Thamali Wijewardhana <[email protected]>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I run the RNN algorithm using deeplearning4j library and the Keras
>>>>> python library. The dataset, hyper parameters, network architecture and 
>>>>> the
>>>>> hardware platform are the same. Given below is the time comparison
>>>>>
>>>>> Deeplearning4j library-40 minutes per 1 epoch
>>>>> Keras library- 4 minutes per 1 epoch
>>>>>
>>>>> I also compared the accuracies[1]. The deeplearning4j library gives a
>>>>> low accuracy compared to Keras library.
>>>>>
>>>>> [1]
>>>>> https://docs.google.com/spreadsheets/d/1-EvC1P7N90k1S_Ly6xVcFlEEKprh7r41Yk8aI6DiSaw/edit#gid=1050346562
>>>>>
>>>>> Thanks
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Apr 1, 2016 at 10:12 AM, Thamali Wijewardhana <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> Hi,
>>>>>> I have organized a review on Monday (4th  of April).
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>> On Thu, Mar 31, 2016 at 3:21 PM, Srinath Perera <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Please setup a review. Shall we do it monday?
>>>>>>>
>>>>>>> On Thu, Mar 31, 2016 at 2:15 PM, Thamali Wijewardhana <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> we have created a spark program to prove the feasibility of adding
>>>>>>>> the RNN algorithm to machine learner.
>>>>>>>> This program demonstrates all the steps in machine learner:
>>>>>>>>
>>>>>>>> Uploading a dataset
>>>>>>>>
>>>>>>>> Selecting the hyper parameters for the model
>>>>>>>>
>>>>>>>> Creating a RNN model using data and training the model
>>>>>>>>
>>>>>>>> Calculating the accuracy of the model
>>>>>>>>
>>>>>>>> Saving the model(As a serialization object)
>>>>>>>>
>>>>>>>> predicting using the model
>>>>>>>>
>>>>>>>> This program is based on deeplearning4j and apache spark pipeline.
>>>>>>>> Deeplearning4j was used as the deep learning library for recurrent 
>>>>>>>> neural
>>>>>>>> network algorithm. As the program should be based on the Spark 
>>>>>>>> pipeline,
>>>>>>>> the main challenge was to use deeplearning4j library with spark 
>>>>>>>> pipeline.
>>>>>>>> The components used in the spark pipeline should be compatible with 
>>>>>>>> spark
>>>>>>>> pipeline. For other components which are not compatible with spark
>>>>>>>> pipeline, we have to wrap them with a org.apache.spark.predictionModel
>>>>>>>> object.
>>>>>>>>
>>>>>>>> We have designed a pipeline with sequence of stages (transformers
>>>>>>>> and estimators):
>>>>>>>>
>>>>>>>> 1. Tokenizer:Transformer-Split each sequential data to tokens.(For
>>>>>>>> example, in sentiment analysis, split text into words)
>>>>>>>>
>>>>>>>> 2. Vectorizer :Transformer-Transforms features into vectors.
>>>>>>>>
>>>>>>>> 3. RNN algorithm :Estimator -RNN algorithm which trains on a data
>>>>>>>> frame and produces a RNN model
>>>>>>>>
>>>>>>>> 4. RNN model : Transformer- Transforms data frame with features to
>>>>>>>> data frame with predictions.
>>>>>>>>
>>>>>>>> The diagrams below explains the stages of the pipeline. The first
>>>>>>>> diagram illustrates the training usage of the pipeline and the next 
>>>>>>>> diagram
>>>>>>>> illustrates the testing and predicting usage of a pipeline.
>>>>>>>>
>>>>>>>>
>>>>>>>> 
>>>>>>>>
>>>>>>>>
>>>>>>>> 
>>>>>>>>
>>>>>>>>
>>>>>>>> I also have tuned the RNN model for hyper parameters[1] and found
>>>>>>>> the values of hyper parameters which optimizes accuracy of the model.
>>>>>>>> Give below is the set of hyper parameters relevant to RNN algorithm
>>>>>>>> and the tuned values.
>>>>>>>>
>>>>>>>>
>>>>>>>> Number of epochs-10
>>>>>>>>
>>>>>>>> Number of iterations- 1
>>>>>>>>
>>>>>>>> Learning rate-0.02
>>>>>>>>
>>>>>>>> We used the aclImdb sentiment analysis data set for this program
>>>>>>>> and with the above hyper parameters, we could achieve 60% accuracy. 
>>>>>>>> And we
>>>>>>>> are trying to improve the accuracy and efficiency of our algorithm.
>>>>>>>>
>>>>>>>> [1]
>>>>>>>> https://docs.google.com/spreadsheets/d/1Wcta6i2k4Je_5l16wCVlH6zBMNGIb-d7USaWdbrkrSw/edit?ts=56fcdc9b#gid=2118685173
>>>>>>>>
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Mar 25, 2016 at 10:18 AM, Thamali Wijewardhana <
>>>>>>>> [email protected]> wrote:
>>>>>>>>
>>>>>>>>> Hi all,
>>>>>>>>>
>>>>>>>>> One of the most important obstacles in machine learning and deep
>>>>>>>>> learning is getting data into a format that neural nets can 
>>>>>>>>> understand.
>>>>>>>>> Neural nets understand vectors. Therefore, vectorization is an 
>>>>>>>>> important
>>>>>>>>> part in building neural network algorithms.
>>>>>>>>>
>>>>>>>>> Canova is a Vectorization library for Machine Learning which is
>>>>>>>>> associated with deeplearning4j library. It is designed to support all 
>>>>>>>>> major
>>>>>>>>> types of input data such as text,csv,image,audio,video and etc.
>>>>>>>>>
>>>>>>>>> In our project to add RNN for Machine Learner, we have to use a
>>>>>>>>> vectorizing component to convert input data to vectors. I think that 
>>>>>>>>> Canova
>>>>>>>>> is a better to build a generic vectorizing component. I am 
>>>>>>>>> researching on
>>>>>>>>> using Canova for the vectorizing purpose.
>>>>>>>>>
>>>>>>>>> Any suggestions on this are highly appreciated.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thanks
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Wed, Mar 2, 2016 at 2:25 PM, Thamali Wijewardhana <
>>>>>>>>> [email protected]> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Srinath,
>>>>>>>>>>
>>>>>>>>>> We have decided to  implement only classification first. Once we
>>>>>>>>>> complete the classification, we hope to do next value prediction too.
>>>>>>>>>> We are basically trying to implement a program to make sure that
>>>>>>>>>> the deeplearning4j library we are using is compatible with apache 
>>>>>>>>>> spark
>>>>>>>>>> pipeline. And also we are trying to demonstrate all the machine 
>>>>>>>>>> learning
>>>>>>>>>> steps with that program.
>>>>>>>>>>
>>>>>>>>>> We are now using aclImdb sentiment analysis data set to verify
>>>>>>>>>> the accuracy of the RNN model we create.
>>>>>>>>>>
>>>>>>>>>> Thanks
>>>>>>>>>> Thamali
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Wed, Mar 2, 2016 at 10:38 AM, Srinath Perera <[email protected]
>>>>>>>>>> > wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi Thamali,
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>    1. RNN can do both classification and predict next value.
>>>>>>>>>>>    Are we trying to do both?
>>>>>>>>>>>    2. When Upul played with it, he had trouble getting
>>>>>>>>>>>    deeplearning4j implementation work with predict next value 
>>>>>>>>>>> scenario. Is it
>>>>>>>>>>>    fixed?
>>>>>>>>>>>    3. What are the data sets we will use to verify the accuracy
>>>>>>>>>>>    of RNN after integration?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --Srinath
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Mar 1, 2016 at 3:44 PM, Thamali Wijewardhana <
>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi,
>>>>>>>>>>>>
>>>>>>>>>>>> Currently we are working on a project to add Recurrent Neural
>>>>>>>>>>>> Network(RNN) algorithm to machine learner. RNN is one of deep 
>>>>>>>>>>>> learning
>>>>>>>>>>>> algorithms with record breaking accuracy. For more information on 
>>>>>>>>>>>> RNN
>>>>>>>>>>>> please refer link[1].
>>>>>>>>>>>>
>>>>>>>>>>>> We have decided to use deeplearning4j which is an open source
>>>>>>>>>>>> deep learning library scalable on spark and Hadoop.
>>>>>>>>>>>>
>>>>>>>>>>>> Since there is a plan to add spark pipeline to machine Learner,
>>>>>>>>>>>> we have decided to use spark pipeline concept to our project.
>>>>>>>>>>>>
>>>>>>>>>>>> I have designed an architecture for the RNN implementation.
>>>>>>>>>>>>
>>>>>>>>>>>> This architecture is developed to be compatible with spark
>>>>>>>>>>>> pipeline.
>>>>>>>>>>>>
>>>>>>>>>>>> Data set is taken in csv format and then it is converted to
>>>>>>>>>>>> spark data frame since apache spark works mostly with data frames.
>>>>>>>>>>>>
>>>>>>>>>>>> Next step is a transformer which is needed to tokenize the
>>>>>>>>>>>> sequential data. A tokenizer is basically used for take a sequence 
>>>>>>>>>>>> of data
>>>>>>>>>>>> and break it into individual units. For example, it can be used to 
>>>>>>>>>>>> break
>>>>>>>>>>>> the words in a sentence to words.
>>>>>>>>>>>>
>>>>>>>>>>>> Next step is again a transformer used to converts tokens to
>>>>>>>>>>>> vectors. This must be done because the features should be added to 
>>>>>>>>>>>> spark
>>>>>>>>>>>> pipeline in org.apache.spark.mllib.linlag.VectorUDT format.
>>>>>>>>>>>>
>>>>>>>>>>>> Next, the transformed data are fed to the data set iterator.
>>>>>>>>>>>> This is an object of a class which implement
>>>>>>>>>>>> org.deeplearning4j.datasets.iterator.DataSetIterator. The dataset 
>>>>>>>>>>>> iterator
>>>>>>>>>>>> traverses through a data set and prepares data for neural networks.
>>>>>>>>>>>>
>>>>>>>>>>>> Next component is the RNN algorithm model which is an
>>>>>>>>>>>> estimator. The iterated data from data set iterator is fed to RNN 
>>>>>>>>>>>> and a
>>>>>>>>>>>> model is generated. Then this model can be used for predictions.
>>>>>>>>>>>>
>>>>>>>>>>>> We have decided to complete this project in two steps :
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>    -
>>>>>>>>>>>>
>>>>>>>>>>>>    First create a spark pipeline program containing the steps
>>>>>>>>>>>>    in machine learner(uploading dataset, generate model, 
>>>>>>>>>>>> calculating accuracy
>>>>>>>>>>>>    and prediction) and check whether the project is feasible.
>>>>>>>>>>>>    -
>>>>>>>>>>>>
>>>>>>>>>>>>    Next add the algorithm to ML
>>>>>>>>>>>>
>>>>>>>>>>>> Currently we have almost completed the first step and now we
>>>>>>>>>>>> are collecting more data and tuning for hyper parameters.
>>>>>>>>>>>>
>>>>>>>>>>>> [1]
>>>>>>>>>>>> https://docs.google.com/document/d/1edg1fdKCYR7-B1oOLy2kon179GSs6x2Zx9oSRDn_NEU/edit
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> 
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> ============================
>>>>>>>>>>> Srinath Perera, Ph.D.
>>>>>>>>>>>    http://people.apache.org/~hemapani/
>>>>>>>>>>>    http://srinathsview.blogspot.com/
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> ============================
>>>>>>> Blog: http://srinathsview.blogspot.com twitter:@srinath_perera
>>>>>>> Site: http://home.apache.org/~hemapani/
>>>>>>> Photos: http://www.flickr.com/photos/hemapani/
>>>>>>> Phone: 0772360902
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> ============================
>>>> Blog: http://srinathsview.blogspot.com twitter:@srinath_perera
>>>> Site: http://home.apache.org/~hemapani/
>>>> Photos: http://www.flickr.com/photos/hemapani/
>>>> Phone: 0772360902
>>>>
>>>
>>>
>>
>> _______________________________________________
>> Architecture mailing list
>> [email protected]
>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>>
>>
>
>
> --
> *Imesh Gunaratne*
> Senior Technical Lead
> WSO2 Inc: http://wso2.com
> T: +94 11 214 5345 M: +94 77 374 2057
> W: http://imesh.io
> Lean . Enterprise . Middleware
>
>


-- 
*Imesh Gunaratne*
Senior Technical Lead
WSO2 Inc: http://wso2.com
T: +94 11 214 5345 M: +94 77 374 2057
W: http://imesh.io
Lean . Enterprise . Middleware

_______________________________________________
Architecture mailing list
[email protected]
https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture

Re: [Architecture] Adding RNN to WSO2 Machine Learner

Reply via email to