Hi, we have created a spark program to prove the feasibility of adding the RNN algorithm to machine learner. This program demonstrates all the steps in machine learner:
Uploading a dataset Selecting the hyper parameters for the model Creating a RNN model using data and training the model Calculating the accuracy of the model Saving the model(As a serialization object) predicting using the model This program is based on deeplearning4j and apache spark pipeline. Deeplearning4j was used as the deep learning library for recurrent neural network algorithm. As the program should be based on the Spark pipeline, the main challenge was to use deeplearning4j library with spark pipeline. The components used in the spark pipeline should be compatible with spark pipeline. For other components which are not compatible with spark pipeline, we have to wrap them with a org.apache.spark.predictionModel object. We have designed a pipeline with sequence of stages (transformers and estimators): 1. Tokenizer:Transformer-Split each sequential data to tokens.(For example, in sentiment analysis, split text into words) 2. Vectorizer :Transformer-Transforms features into vectors. 3. RNN algorithm :Estimator -RNN algorithm which trains on a data frame and produces a RNN model 4. RNN model : Transformer- Transforms data frame with features to data frame with predictions. The diagrams below explains the stages of the pipeline. The first diagram illustrates the training usage of the pipeline and the next diagram illustrates the testing and predicting usage of a pipeline. I also have tuned the RNN model for hyper parameters[1] and found the values of hyper parameters which optimizes accuracy of the model. Give below is the set of hyper parameters relevant to RNN algorithm and the tuned values. Number of epochs-10 Number of iterations- 1 Learning rate-0.02 We used the aclImdb sentiment analysis data set for this program and with the above hyper parameters, we could achieve 60% accuracy. And we are trying to improve the accuracy and efficiency of our algorithm. [1] https://docs.google.com/spreadsheets/d/1Wcta6i2k4Je_5l16wCVlH6zBMNGIb-d7USaWdbrkrSw/edit?ts=56fcdc9b#gid=2118685173 Thanks On Fri, Mar 25, 2016 at 10:18 AM, Thamali Wijewardhana <[email protected]> wrote: > Hi all, > > One of the most important obstacles in machine learning and deep learning > is getting data into a format that neural nets can understand. Neural nets > understand vectors. Therefore, vectorization is an important part in > building neural network algorithms. > > Canova is a Vectorization library for Machine Learning which is associated > with deeplearning4j library. It is designed to support all major types of > input data such as text,csv,image,audio,video and etc. > > In our project to add RNN for Machine Learner, we have to use a > vectorizing component to convert input data to vectors. I think that Canova > is a better to build a generic vectorizing component. I am researching on > using Canova for the vectorizing purpose. > > Any suggestions on this are highly appreciated. > > > Thanks > > > > On Wed, Mar 2, 2016 at 2:25 PM, Thamali Wijewardhana <[email protected]> > wrote: > >> Hi Srinath, >> >> We have decided to implement only classification first. Once we complete >> the classification, we hope to do next value prediction too. >> We are basically trying to implement a program to make sure that the >> deeplearning4j library we are using is compatible with apache spark >> pipeline. And also we are trying to demonstrate all the machine learning >> steps with that program. >> >> We are now using aclImdb sentiment analysis data set to verify the >> accuracy of the RNN model we create. >> >> Thanks >> Thamali >> >> >> On Wed, Mar 2, 2016 at 10:38 AM, Srinath Perera <[email protected]> wrote: >> >>> Hi Thamali, >>> >>> >>> 1. RNN can do both classification and predict next value. Are we >>> trying to do both? >>> 2. When Upul played with it, he had trouble getting deeplearning4j >>> implementation work with predict next value scenario. Is it fixed? >>> 3. What are the data sets we will use to verify the accuracy of RNN >>> after integration? >>> >>> >>> --Srinath >>> >>> On Tue, Mar 1, 2016 at 3:44 PM, Thamali Wijewardhana <[email protected]> >>> wrote: >>> >>>> Hi, >>>> >>>> Currently we are working on a project to add Recurrent Neural >>>> Network(RNN) algorithm to machine learner. RNN is one of deep learning >>>> algorithms with record breaking accuracy. For more information on RNN >>>> please refer link[1]. >>>> >>>> We have decided to use deeplearning4j which is an open source deep >>>> learning library scalable on spark and Hadoop. >>>> >>>> Since there is a plan to add spark pipeline to machine Learner, we have >>>> decided to use spark pipeline concept to our project. >>>> >>>> I have designed an architecture for the RNN implementation. >>>> >>>> This architecture is developed to be compatible with spark pipeline. >>>> >>>> Data set is taken in csv format and then it is converted to spark data >>>> frame since apache spark works mostly with data frames. >>>> >>>> Next step is a transformer which is needed to tokenize the sequential >>>> data. A tokenizer is basically used for take a sequence of data and break >>>> it into individual units. For example, it can be used to break the words in >>>> a sentence to words. >>>> >>>> Next step is again a transformer used to converts tokens to vectors. >>>> This must be done because the features should be added to spark pipeline in >>>> org.apache.spark.mllib.linlag.VectorUDT format. >>>> >>>> Next, the transformed data are fed to the data set iterator. This is an >>>> object of a class which implement >>>> org.deeplearning4j.datasets.iterator.DataSetIterator. The dataset iterator >>>> traverses through a data set and prepares data for neural networks. >>>> >>>> Next component is the RNN algorithm model which is an estimator. The >>>> iterated data from data set iterator is fed to RNN and a model is >>>> generated. Then this model can be used for predictions. >>>> >>>> We have decided to complete this project in two steps : >>>> >>>> >>>> - >>>> >>>> First create a spark pipeline program containing the steps in >>>> machine learner(uploading dataset, generate model, calculating accuracy >>>> and >>>> prediction) and check whether the project is feasible. >>>> - >>>> >>>> Next add the algorithm to ML >>>> >>>> Currently we have almost completed the first step and now we are >>>> collecting more data and tuning for hyper parameters. >>>> >>>> [1] >>>> https://docs.google.com/document/d/1edg1fdKCYR7-B1oOLy2kon179GSs6x2Zx9oSRDn_NEU/edit >>>> >>>> >>>> >>>> >>>> >>> >>> >>> >>> -- >>> ============================ >>> Srinath Perera, Ph.D. >>> http://people.apache.org/~hemapani/ >>> http://srinathsview.blogspot.com/ >>> >> >> >
_______________________________________________ Architecture mailing list [email protected] https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
