Hi, I have used a dataset with 25000 rows and the size is 80 MB.
The link to the dataset is: http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz On Fri, Apr 8, 2016 at 3:07 PM, Srinath Perera <[email protected]> wrote: > Thamali, how big is the data set you are using? ( give me a link to the > data set as well). > > Nirmal, shall we compare the accuracy of RNN vs. Upul's rolling window > method? > > --Srinath > > On Fri, Apr 8, 2016 at 9:23 AM, Thamali Wijewardhana <[email protected]> > wrote: > >> Hi, >> >> I run the RNN algorithm using deeplearning4j library and the Keras python >> library. The dataset, hyper parameters, network architecture and the >> hardware platform are the same. Given below is the time comparison >> >> Deeplearning4j library-40 minutes per 1 epoch >> Keras library- 4 minutes per 1 epoch >> >> I also compared the accuracies[1]. The deeplearning4j library gives a low >> accuracy compared to Keras library. >> >> [1] >> https://docs.google.com/spreadsheets/d/1-EvC1P7N90k1S_Ly6xVcFlEEKprh7r41Yk8aI6DiSaw/edit#gid=1050346562 >> >> Thanks >> >> >> >> On Fri, Apr 1, 2016 at 10:12 AM, Thamali Wijewardhana <[email protected]> >> wrote: >> >>> Hi, >>> I have organized a review on Monday (4th of April). >>> >>> Thanks >>> >>> On Thu, Mar 31, 2016 at 3:21 PM, Srinath Perera <[email protected]> >>> wrote: >>> >>>> Please setup a review. Shall we do it monday? >>>> >>>> On Thu, Mar 31, 2016 at 2:15 PM, Thamali Wijewardhana <[email protected] >>>> > wrote: >>>> >>>>> Hi, >>>>> >>>>> we have created a spark program to prove the feasibility of adding the >>>>> RNN algorithm to machine learner. >>>>> This program demonstrates all the steps in machine learner: >>>>> >>>>> Uploading a dataset >>>>> >>>>> Selecting the hyper parameters for the model >>>>> >>>>> Creating a RNN model using data and training the model >>>>> >>>>> Calculating the accuracy of the model >>>>> >>>>> Saving the model(As a serialization object) >>>>> >>>>> predicting using the model >>>>> >>>>> This program is based on deeplearning4j and apache spark pipeline. >>>>> Deeplearning4j was used as the deep learning library for recurrent neural >>>>> network algorithm. As the program should be based on the Spark pipeline, >>>>> the main challenge was to use deeplearning4j library with spark pipeline. >>>>> The components used in the spark pipeline should be compatible with spark >>>>> pipeline. For other components which are not compatible with spark >>>>> pipeline, we have to wrap them with a org.apache.spark.predictionModel >>>>> object. >>>>> >>>>> We have designed a pipeline with sequence of stages (transformers and >>>>> estimators): >>>>> >>>>> 1. Tokenizer:Transformer-Split each sequential data to tokens.(For >>>>> example, in sentiment analysis, split text into words) >>>>> >>>>> 2. Vectorizer :Transformer-Transforms features into vectors. >>>>> >>>>> 3. RNN algorithm :Estimator -RNN algorithm which trains on a data >>>>> frame and produces a RNN model >>>>> >>>>> 4. RNN model : Transformer- Transforms data frame with features to >>>>> data frame with predictions. >>>>> >>>>> The diagrams below explains the stages of the pipeline. The first >>>>> diagram illustrates the training usage of the pipeline and the next >>>>> diagram >>>>> illustrates the testing and predicting usage of a pipeline. >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> I also have tuned the RNN model for hyper parameters[1] and found the >>>>> values of hyper parameters which optimizes accuracy of the model. >>>>> Give below is the set of hyper parameters relevant to RNN algorithm >>>>> and the tuned values. >>>>> >>>>> >>>>> Number of epochs-10 >>>>> >>>>> Number of iterations- 1 >>>>> >>>>> Learning rate-0.02 >>>>> >>>>> We used the aclImdb sentiment analysis data set for this program and >>>>> with the above hyper parameters, we could achieve 60% accuracy. And we are >>>>> trying to improve the accuracy and efficiency of our algorithm. >>>>> >>>>> [1] >>>>> https://docs.google.com/spreadsheets/d/1Wcta6i2k4Je_5l16wCVlH6zBMNGIb-d7USaWdbrkrSw/edit?ts=56fcdc9b#gid=2118685173 >>>>> >>>>> >>>>> Thanks >>>>> >>>>> >>>>> >>>>> On Fri, Mar 25, 2016 at 10:18 AM, Thamali Wijewardhana < >>>>> [email protected]> wrote: >>>>> >>>>>> Hi all, >>>>>> >>>>>> One of the most important obstacles in machine learning and deep >>>>>> learning is getting data into a format that neural nets can understand. >>>>>> Neural nets understand vectors. Therefore, vectorization is an important >>>>>> part in building neural network algorithms. >>>>>> >>>>>> Canova is a Vectorization library for Machine Learning which is >>>>>> associated with deeplearning4j library. It is designed to support all >>>>>> major >>>>>> types of input data such as text,csv,image,audio,video and etc. >>>>>> >>>>>> In our project to add RNN for Machine Learner, we have to use a >>>>>> vectorizing component to convert input data to vectors. I think that >>>>>> Canova >>>>>> is a better to build a generic vectorizing component. I am researching on >>>>>> using Canova for the vectorizing purpose. >>>>>> >>>>>> Any suggestions on this are highly appreciated. >>>>>> >>>>>> >>>>>> Thanks >>>>>> >>>>>> >>>>>> >>>>>> On Wed, Mar 2, 2016 at 2:25 PM, Thamali Wijewardhana < >>>>>> [email protected]> wrote: >>>>>> >>>>>>> Hi Srinath, >>>>>>> >>>>>>> We have decided to implement only classification first. Once we >>>>>>> complete the classification, we hope to do next value prediction too. >>>>>>> We are basically trying to implement a program to make sure that the >>>>>>> deeplearning4j library we are using is compatible with apache spark >>>>>>> pipeline. And also we are trying to demonstrate all the machine learning >>>>>>> steps with that program. >>>>>>> >>>>>>> We are now using aclImdb sentiment analysis data set to verify the >>>>>>> accuracy of the RNN model we create. >>>>>>> >>>>>>> Thanks >>>>>>> Thamali >>>>>>> >>>>>>> >>>>>>> On Wed, Mar 2, 2016 at 10:38 AM, Srinath Perera <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>>> Hi Thamali, >>>>>>>> >>>>>>>> >>>>>>>> 1. RNN can do both classification and predict next value. Are >>>>>>>> we trying to do both? >>>>>>>> 2. When Upul played with it, he had trouble getting >>>>>>>> deeplearning4j implementation work with predict next value >>>>>>>> scenario. Is it >>>>>>>> fixed? >>>>>>>> 3. What are the data sets we will use to verify the accuracy of >>>>>>>> RNN after integration? >>>>>>>> >>>>>>>> >>>>>>>> --Srinath >>>>>>>> >>>>>>>> On Tue, Mar 1, 2016 at 3:44 PM, Thamali Wijewardhana < >>>>>>>> [email protected]> wrote: >>>>>>>> >>>>>>>>> Hi, >>>>>>>>> >>>>>>>>> Currently we are working on a project to add Recurrent Neural >>>>>>>>> Network(RNN) algorithm to machine learner. RNN is one of deep learning >>>>>>>>> algorithms with record breaking accuracy. For more information on RNN >>>>>>>>> please refer link[1]. >>>>>>>>> >>>>>>>>> We have decided to use deeplearning4j which is an open source deep >>>>>>>>> learning library scalable on spark and Hadoop. >>>>>>>>> >>>>>>>>> Since there is a plan to add spark pipeline to machine Learner, we >>>>>>>>> have decided to use spark pipeline concept to our project. >>>>>>>>> >>>>>>>>> I have designed an architecture for the RNN implementation. >>>>>>>>> >>>>>>>>> This architecture is developed to be compatible with spark >>>>>>>>> pipeline. >>>>>>>>> >>>>>>>>> Data set is taken in csv format and then it is converted to spark >>>>>>>>> data frame since apache spark works mostly with data frames. >>>>>>>>> >>>>>>>>> Next step is a transformer which is needed to tokenize the >>>>>>>>> sequential data. A tokenizer is basically used for take a sequence of >>>>>>>>> data >>>>>>>>> and break it into individual units. For example, it can be used to >>>>>>>>> break >>>>>>>>> the words in a sentence to words. >>>>>>>>> >>>>>>>>> Next step is again a transformer used to converts tokens to >>>>>>>>> vectors. This must be done because the features should be added to >>>>>>>>> spark >>>>>>>>> pipeline in org.apache.spark.mllib.linlag.VectorUDT format. >>>>>>>>> >>>>>>>>> Next, the transformed data are fed to the data set iterator. This >>>>>>>>> is an object of a class which implement >>>>>>>>> org.deeplearning4j.datasets.iterator.DataSetIterator. The dataset >>>>>>>>> iterator >>>>>>>>> traverses through a data set and prepares data for neural networks. >>>>>>>>> >>>>>>>>> Next component is the RNN algorithm model which is an estimator. >>>>>>>>> The iterated data from data set iterator is fed to RNN and a model is >>>>>>>>> generated. Then this model can be used for predictions. >>>>>>>>> >>>>>>>>> We have decided to complete this project in two steps : >>>>>>>>> >>>>>>>>> >>>>>>>>> - >>>>>>>>> >>>>>>>>> First create a spark pipeline program containing the steps in >>>>>>>>> machine learner(uploading dataset, generate model, calculating >>>>>>>>> accuracy and >>>>>>>>> prediction) and check whether the project is feasible. >>>>>>>>> - >>>>>>>>> >>>>>>>>> Next add the algorithm to ML >>>>>>>>> >>>>>>>>> Currently we have almost completed the first step and now we are >>>>>>>>> collecting more data and tuning for hyper parameters. >>>>>>>>> >>>>>>>>> [1] >>>>>>>>> https://docs.google.com/document/d/1edg1fdKCYR7-B1oOLy2kon179GSs6x2Zx9oSRDn_NEU/edit >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> ============================ >>>>>>>> Srinath Perera, Ph.D. >>>>>>>> http://people.apache.org/~hemapani/ >>>>>>>> http://srinathsview.blogspot.com/ >>>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>>> >>>> -- >>>> ============================ >>>> Blog: http://srinathsview.blogspot.com twitter:@srinath_perera >>>> Site: http://home.apache.org/~hemapani/ >>>> Photos: http://www.flickr.com/photos/hemapani/ >>>> Phone: 0772360902 >>>> >>> >>> >> > > > -- > ============================ > Blog: http://srinathsview.blogspot.com twitter:@srinath_perera > Site: http://home.apache.org/~hemapani/ > Photos: http://www.flickr.com/photos/hemapani/ > Phone: 0772360902 >
_______________________________________________ Architecture mailing list [email protected] https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
