Re: [Dev] Fwd: GSOC2016: Proposal 6: [ML]

Maheshakya Wijewardena Fri, 25 Mar 2016 05:18:48 -0700

Hi Mahesh,

Thank you for sending the draft. Please submit it as soon as possible.


Few high level comments:

In the proposal, you must specifically mention that this will be
implemented as a Siddhi extension that can operate directly on incoming
streams.

Also, you need to have a time line for the project, A sample looks like:

May 1- May 20 - Community bonding period - Getting familiar with the
platform and discussing implementation methods.
May 20 - May 30 - Implementing streaming k-means,
-----
-----
July 20-24 - Writing examples
July 24-18 - Documentation

This should end before pencils down date. Refer to the correct time line
given in GSoC site.

The implementation details of the the streaming algorithms looks fine.

Best regards.


On Fri, Mar 25, 2016 at 5:23 PM, Mahesh Dananjaya <[email protected]
> wrote:

> Hi Maheshakya,
> this is my draft proposal.
>
> https://docs.google.com/document/d/1apZfEXZXEH5GwSwS7hARINbGw5_zinxWdZjEmyqfKu4/edit?usp=sha
> <https://docs.google.com/document/d/1apZfEXZXEH5GwSwS7hARINbGw5_zinxWdZjEmyqfKu4/edit?usp=sharing>
> ring
> can you ple check this and see whether it is correct.thank you.
> BR,
> Mahesh
>
>
> On Mon, Mar 21, 2016 at 1:15 PM, Maheshakya Wijewardena <
> [email protected]> wrote:
>
>> Hi Mahesh,
>>
>> The deadline for submitting your proposals is on March 25th, 2016,
>> therefore please start writing the proposal and get feedback.
>>
>> Best regards.
>>
>> On Tue, Mar 15, 2016 at 4:14 PM, Mahesh Dananjaya <
>> [email protected]> wrote:
>>
>>> Hi Maheshakaya,
>>> Ok.I have been trying some examples and try to split them and train
>>> incrementally. Still doing that. i have been adding them to my github repo
>>> too. https://github.com/dananjayamahesh/GSOC2016 . i saw that there is
>>> only scala API support for those streaming algorithms in Spark. so my task
>>> is to develop Java API. will let you nkow my progress.thank you very much.
>>> BR,
>>> Mahesh
>>>
>>> On Tue, Mar 15, 2016 at 3:21 PM, Maheshakya Wijewardena <
>>> [email protected]> wrote:
>>>
>>>> Hi Mahesh,
>>>>
>>>> No you don't need to use Hadoop at any stage in this project.
>>>> Everything you need is in Spark (regarding ML algorithms).
>>>> You can also use Spark MLLibs methods to randomly split datasets.
>>>>
>>>> Best regards.
>>>>
>>>> On Mon, Mar 14, 2016 at 1:28 PM, Mahesh Dananjaya <
>>>> [email protected]> wrote:
>>>>
>>>>> Hi Maheshakya,
>>>>> I am writing some java programs and try to break the dataset into
>>>>> several pieces and train a model repeatedly with those data sets using
>>>>> Spark MLLib. Do i have to do anything with Hadoop at this stage, because i
>>>>> am working with a standalone mode.thank you.
>>>>> BR,
>>>>> Mahesh.
>>>>>
>>>>> On Sun, Mar 13, 2016 at 6:30 PM, Maheshakya Wijewardena <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> Hi Mahesh,
>>>>>>
>>>>>> You don't have to look into carbon-ml.
>>>>>>
>>>>>> Best regards.
>>>>>>
>>>>>> On Sun, Mar 13, 2016 at 5:49 PM, Mahesh Dananjaya <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> Hi maheshakya,
>>>>>>> i am working on some examples related to Spark and ML.is there
>>>>>>> anything to do with carbon-ml. I think i dont need to look into that 
>>>>>>> one.do
>>>>>>> i?
>>>>>>> BR,
>>>>>>> Mahesh
>>>>>>>
>>>>>>> On Tue, Mar 8, 2016 at 11:55 AM, Maheshakya Wijewardena <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>> Hi Mahesh,
>>>>>>>>
>>>>>>>> does that Scala API is with your current product or repo?
>>>>>>>>
>>>>>>>>
>>>>>>>> No, we don't have the Scala API included. What we want is to design
>>>>>>>> the Java implementations of those algorithms to train with 
>>>>>>>> mini-batches of
>>>>>>>> streaming data with the help of the aforementioned methods so that we 
>>>>>>>> can
>>>>>>>> include in as a CEP extension.
>>>>>>>>
>>>>>>>> As to clarify, please try to write a simple Java program using
>>>>>>>> Spark MLLib linear regression and k-means clustering with a sample 
>>>>>>>> data set
>>>>>>>> (You can find alot of data sets from UCI repo[1]).  You need to break 
>>>>>>>> the
>>>>>>>> dataset into several pieces and train a model repeatedly with those.
>>>>>>>> After each training run, save the model information (such as
>>>>>>>> weights, intercepts for regression and cluster centers for clustering -
>>>>>>>> please check the arguments of those methods I have mentioned and save 
>>>>>>>> the
>>>>>>>> required information of the model)
>>>>>>>> When training a model we a new piece of data, use those methods to
>>>>>>>> initialize and put the save values for the arguments. This way you can
>>>>>>>> start from where you stopped in the previous run.
>>>>>>>>
>>>>>>>> Let us know your observations and feel free to ask if you need to
>>>>>>>> know anything more on this.
>>>>>>>>
>>>>>>>> We'll let you know what needs to be done to include this in CEP.
>>>>>>>>
>>>>>>>> Best regards.
>>>>>>>>
>>>>>>>> On Tue, Mar 8, 2016 at 10:59 AM, Mahesh Dananjaya <
>>>>>>>> [email protected]> wrote:
>>>>>>>>
>>>>>>>>> Hi Maheshakya,
>>>>>>>>> great.thank you.i already have ML and CEP and working more towards
>>>>>>>>> it. does that Scala API is with your current product or repo?.  thank 
>>>>>>>>> you.
>>>>>>>>> BR,
>>>>>>>>> Mahesh.
>>>>>>>>>
>>>>>>>>> On Sun, Mar 6, 2016 at 5:49 PM, Maheshakya Wijewardena <
>>>>>>>>> [email protected]> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Mahesh,
>>>>>>>>>>
>>>>>>>>>> Please find the comments inline.
>>>>>>>>>>
>>>>>>>>>> does data stream is taken to ML as the event publisher's format
>>>>>>>>>>> through event publisher. Or  we can use direct traffic that comes 
>>>>>>>>>>> to event
>>>>>>>>>>> receiver, or else as streams
>>>>>>>>>>>
>>>>>>>>>> We intend to use the direct data as even streams.
>>>>>>>>>>
>>>>>>>>>> 1.) Those data coming from wso2 DAS to ML are coming as streams?
>>>>>>>>>>>
>>>>>>>>>> No, WSO2 ML doesn't use any even stream. The data stored in
>>>>>>>>>> tables in DAS is loaded into ML.
>>>>>>>>>>
>>>>>>>>>> 2.) Are there any incremental learning algorithms currently
>>>>>>>>>>> active in ML?you mentioned that there are and they are with scala 
>>>>>>>>>>> API. So
>>>>>>>>>>> there is a streaming support with that Scala API. In that API which 
>>>>>>>>>>> format
>>>>>>>>>>> the data is aquired to ML?
>>>>>>>>>>>
>>>>>>>>>> No, there are no incremental learning algorithms in ML. The scala
>>>>>>>>>> API is about Spark MLLib. MLLib supports streaming k-means and other
>>>>>>>>>> generalized linear models (linear regression variants and logistic
>>>>>>>>>> regression) with Scala API. What they basically do in those 
>>>>>>>>>> implementations
>>>>>>>>>> is retraining the trained models with mini batches when data 
>>>>>>>>>> sequentially
>>>>>>>>>> arrives. There, the breaking of streaming data into mini batches is 
>>>>>>>>>> done
>>>>>>>>>> with the help of Spark Streaming. But we do not intend to use Spark
>>>>>>>>>> streaming in our implementation. What we need to do is implement a 
>>>>>>>>>> similar
>>>>>>>>>> behavior for event streams using the Java API.  The Java API has the
>>>>>>>>>> following methods:
>>>>>>>>>>
>>>>>>>>>>    - *createModel
>>>>>>>>>>    
>>>>>>>>>> <http://spark.apache.org/docs/latest/api/java/org/apache/spark/mllib/regression/LinearRegressionWithSGD.html#createModel%28org.apache.spark.mllib.linalg.Vector,%20double%29>*
>>>>>>>>>>    (Vector
>>>>>>>>>>    
>>>>>>>>>> <http://spark.apache.org/docs/latest/api/java/org/apache/spark/mllib/linalg/Vector.html>
>>>>>>>>>>  weights,
>>>>>>>>>>    double intercept) - for GLMs
>>>>>>>>>>    - *setInitialModel
>>>>>>>>>>    
>>>>>>>>>> <http://spark.apache.org/docs/latest/api/java/org/apache/spark/mllib/clustering/KMeans.html#setInitialModel%28org.apache.spark.mllib.clustering.KMeansModel%29>*
>>>>>>>>>>    (KMeansModel
>>>>>>>>>>    
>>>>>>>>>> <http://spark.apache.org/docs/latest/api/java/org/apache/spark/mllib/clustering/KMeansModel.html>
>>>>>>>>>>  model)
>>>>>>>>>>    - for K means
>>>>>>>>>>
>>>>>>>>>> With the help of these methods, we can train models again with
>>>>>>>>>> newly arriving data, keeping the characteristics learned with the 
>>>>>>>>>> previous
>>>>>>>>>> data. When implementing this, we need to pay attention to other 
>>>>>>>>>> parameters
>>>>>>>>>> of incremental learning such as data horizon and data obsolescence
>>>>>>>>>> (indicated in the project ideas page).
>>>>>>>>>> We need to discuss on how to add these with CEP event streams. I
>>>>>>>>>> have added Suho into the thread for more clarification.
>>>>>>>>>>
>>>>>>>>>> Best regards.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Sat, Mar 5, 2016 at 5:15 PM, Mahesh Dananjaya <
>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi maheshakya,
>>>>>>>>>>> as we concerned to use WSO2 CEP to handle streaming data and
>>>>>>>>>>> implement the machine learning algorithms with Spark MLLib, does 
>>>>>>>>>>> data
>>>>>>>>>>> stream is taken to ML as the event publisher's format through event
>>>>>>>>>>> publisher. Or  we can use direct traffic that comes to event 
>>>>>>>>>>> receiver, or
>>>>>>>>>>> else as streams. referring to
>>>>>>>>>>> https://docs.wso2.com/display/CEP410/User+Guide
>>>>>>>>>>>     1.) Those data coming from wso2 DAS to ML are coming as
>>>>>>>>>>> streams?
>>>>>>>>>>>     2.) Are there any incremental learning algorithms currently
>>>>>>>>>>> active in ML?you mentioned that there are and they are with scala 
>>>>>>>>>>> API. So
>>>>>>>>>>> there is a streaming support with that Scala API. In that API which 
>>>>>>>>>>> format
>>>>>>>>>>> the data is aquired to ML?
>>>>>>>>>>>
>>>>>>>>>>> thank you.
>>>>>>>>>>> BR,
>>>>>>>>>>> Mahesh.
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Mar 4, 2016 at 2:03 PM, Maheshakya Wijewardena <
>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi Mahesh,
>>>>>>>>>>>>
>>>>>>>>>>>> We had to modify a the project scope a little to suit best for
>>>>>>>>>>>> the requirements. We will update the project idea with those 
>>>>>>>>>>>> concerns soon
>>>>>>>>>>>> and let you know.
>>>>>>>>>>>>
>>>>>>>>>>>> We do not support streaming data in WSO2 Machine learner at the
>>>>>>>>>>>> moment. The new concern is to use WSO2 CEP to handle streaming 
>>>>>>>>>>>> data and
>>>>>>>>>>>> implement the machine learning algorithms with Spark MLLib. You 
>>>>>>>>>>>> can look at
>>>>>>>>>>>> the streaming k-means and streaming linear regression 
>>>>>>>>>>>> implementations in
>>>>>>>>>>>> MLLib. Currently, the API is only for scala. Our need is to get 
>>>>>>>>>>>> the Java
>>>>>>>>>>>> APIs of k-means and generalized linear models to support 
>>>>>>>>>>>> incremental
>>>>>>>>>>>> learning with streaming data. This has to be done as mini-batch 
>>>>>>>>>>>> learning
>>>>>>>>>>>> since these algorithms operates as stochastic gradient descents so 
>>>>>>>>>>>> that any
>>>>>>>>>>>> learning with new data can be done on top of the previously 
>>>>>>>>>>>> learned models.
>>>>>>>>>>>> So please go through the those APIs[1][2][3] and try to get an 
>>>>>>>>>>>> idea.
>>>>>>>>>>>> Also please try to understand how event streams work in WSO2
>>>>>>>>>>>> CEP [4][5].
>>>>>>>>>>>>
>>>>>>>>>>>> Best regards.
>>>>>>>>>>>>
>>>>>>>>>>>> [1]
>>>>>>>>>>>> http://spark.apache.org/docs/latest/api/java/org/apache/spark/mllib/regression/LinearRegressionWithSGD.html
>>>>>>>>>>>> [2]
>>>>>>>>>>>> http://spark.apache.org/docs/latest/api/java/org/apache/spark/mllib/clustering/KMeans.html
>>>>>>>>>>>> [3]
>>>>>>>>>>>> http://spark.apache.org/docs/latest/api/java/org/apache/spark/mllib/classification/LogisticRegressionWithSGD.html
>>>>>>>>>>>> [4]
>>>>>>>>>>>> https://docs.wso2.com/display/CEP310/Working+with+Event+Streams
>>>>>>>>>>>> [5]
>>>>>>>>>>>> https://docs.wso2.com/display/CEP310/Working+with+Execution+Plans
>>>>>>>>>>>>
>>>>>>>>>>>> On Fri, Mar 4, 2016 at 11:26 AM, Mahesh Dananjaya <
>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi maheshakya,
>>>>>>>>>>>>> give me sometime to go through your ML package. Do current
>>>>>>>>>>>>> product have any stream data support?. i did some university 
>>>>>>>>>>>>> projects
>>>>>>>>>>>>> related to machine learning with regressions,modelling, factor 
>>>>>>>>>>>>> analysis,
>>>>>>>>>>>>> cluster analysis and classification problems (Discriminant 
>>>>>>>>>>>>> Analysis) with
>>>>>>>>>>>>> SVM (Support Vector machines), Neural networks, LS classification 
>>>>>>>>>>>>> and
>>>>>>>>>>>>> ML(Maximum likelihood). give me sometime to see how wso2 
>>>>>>>>>>>>> architecture
>>>>>>>>>>>>> works.then i can come up with good architecture.thank you.
>>>>>>>>>>>>> BR,
>>>>>>>>>>>>> Mahesh.
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Wed, Mar 2, 2016 at 2:41 PM, Mahesh Dananjaya <
>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi Maheshakya,
>>>>>>>>>>>>>> Thank you for the resources. I will go through this and
>>>>>>>>>>>>>> looking forward to this proposed project.Thank you.
>>>>>>>>>>>>>> BR,
>>>>>>>>>>>>>> Mahesh.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Wed, Mar 2, 2016 at 1:52 PM, Maheshakya Wijewardena <
>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi Mahesh,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thank you for the interest for this project.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> We would like to know what type of similar projects you have
>>>>>>>>>>>>>>> worked on. You may have seen that WSO2 Machine Learner supports 
>>>>>>>>>>>>>>> several
>>>>>>>>>>>>>>> learning algorithms at the moment[1]. This project intends to 
>>>>>>>>>>>>>>> leverage the
>>>>>>>>>>>>>>> existing algorithms in WSO2 Machine Learner to support 
>>>>>>>>>>>>>>> streaming data. As
>>>>>>>>>>>>>>> an initiative, first you can get an idea about what WSO2 
>>>>>>>>>>>>>>> Machine Learner
>>>>>>>>>>>>>>> does and how it operates. You can download WSO2 Machine Learner 
>>>>>>>>>>>>>>> from
>>>>>>>>>>>>>>> product page[2] and the the source code [3]. ML is using Apache 
>>>>>>>>>>>>>>> Spark
>>>>>>>>>>>>>>> MLLib[4] for its' algorithms so it's better to read and 
>>>>>>>>>>>>>>> understand what it
>>>>>>>>>>>>>>> does as well.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> In order to get an idea about the deliverables and the scope
>>>>>>>>>>>>>>> of this project, try to understand how Spark streaming[5] (see 
>>>>>>>>>>>>>>> examples)
>>>>>>>>>>>>>>> handles streaming data. Also, have a look in the streaming 
>>>>>>>>>>>>>>> algorithms[6][7]
>>>>>>>>>>>>>>> supported by MLLib. There are two approaches discussed to employ
>>>>>>>>>>>>>>> incremental learning in ML in the project proposals page. These 
>>>>>>>>>>>>>>> streaming
>>>>>>>>>>>>>>> algorithms can be directly used in the first approach. For the 
>>>>>>>>>>>>>>> other
>>>>>>>>>>>>>>> approach, the your implementation should contain a procedure to 
>>>>>>>>>>>>>>> create mini
>>>>>>>>>>>>>>> batches from streaming data with relevant sizes (i.e. a moving 
>>>>>>>>>>>>>>> window) and
>>>>>>>>>>>>>>> do periodic retraining of the same algorithm.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> To start with the project, you will need to come up with a
>>>>>>>>>>>>>>> suitable plan and an architecture first.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Please watch the video referenced in the proposal
>>>>>>>>>>>>>>> (reference: 5). It will help you getting a better idea about 
>>>>>>>>>>>>>>> machine
>>>>>>>>>>>>>>> learning algorithms with streaming data.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Let us know if you need any help with these.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Best regards
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>>> https://docs.wso2.com/display/ML110/Machine+Learner+Algorithms
>>>>>>>>>>>>>>> [2] http://wso2.com/products/machine-learner/
>>>>>>>>>>>>>>> [3]
>>>>>>>>>>>>>>> https://docs.wso2.com/display/ML110/Building+from+Source#BuildingfromSource-Downloadingthesourcecheckout
>>>>>>>>>>>>>>> [4] https://spark.apache.org/docs/1.4.1/mllib-guide.html
>>>>>>>>>>>>>>> [5]
>>>>>>>>>>>>>>> https://spark.apache.org/docs/1.4.1/streaming-programming-guide.html
>>>>>>>>>>>>>>> [6]
>>>>>>>>>>>>>>> https://spark.apache.org/docs/1.4.1/mllib-linear-methods.html#streaming-linear-regression
>>>>>>>>>>>>>>> [7]
>>>>>>>>>>>>>>> https://spark.apache.org/docs/1.4.1/mllib-clustering.html#streaming-k-means
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Wed, Mar 2, 2016 at 1:19 PM, Mahesh Dananjaya <
>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>>>>> I am interesting on contribute to proposal 6: "Predictive
>>>>>>>>>>>>>>>> analytic with online data for WSO2 Machine Learner" for GSOC2 
>>>>>>>>>>>>>>>> this time.
>>>>>>>>>>>>>>>> Since i have been engaging with some similar projects i think 
>>>>>>>>>>>>>>>> it will be a
>>>>>>>>>>>>>>>> great experience for me. Please let me know what you think and 
>>>>>>>>>>>>>>>> what you
>>>>>>>>>>>>>>>> suggest. I have been going through your documents.thank you.
>>>>>>>>>>>>>>>> regards,
>>>>>>>>>>>>>>>> Mahesh Dananjaya.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>>> Dev mailing list
>>>>>>>>>>>>>>>> [email protected]
>>>>>>>>>>>>>>>> http://wso2.org/cgi-bin/mailman/listinfo/dev
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>> Pruthuvi Maheshakya Wijewardena
>>>>>>>>>>>>>>> [email protected]
>>>>>>>>>>>>>>> +94711228855
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> Pruthuvi Maheshakya Wijewardena
>>>>>>>>>>>> [email protected]
>>>>>>>>>>>> +94711228855
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Pruthuvi Maheshakya Wijewardena
>>>>>>>>>> [email protected]
>>>>>>>>>> +94711228855
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Pruthuvi Maheshakya Wijewardena
>>>>>>>> [email protected]
>>>>>>>> +94711228855
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Pruthuvi Maheshakya Wijewardena
>>>>>> [email protected]
>>>>>> +94711228855
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Pruthuvi Maheshakya Wijewardena
>>>> [email protected]
>>>> +94711228855
>>>>
>>>>
>>>>
>>>
>>
>>
>> --
>> Pruthuvi Maheshakya Wijewardena
>> [email protected]
>> +94711228855
>>
>>
>>
>


-- 
Pruthuvi Maheshakya Wijewardena
[email protected]
+94711228855

_______________________________________________
Dev mailing list
[email protected]
http://wso2.org/cgi-bin/mailman/listinfo/dev

Re: [Dev] Fwd: GSOC2016: Proposal 6: [ML]

Reply via email to