Re: [Dev] [ml][cep][gsoc-6] Initial Project Meeting-Predictive Analysis with online data

2016-05-25 Thread Supun Sethunga
Hi Mahesh,

Before you start on the siddhi extension, can you also do a small
evaluation on the rate that each of those SGD algorithms train models, and
share the results? (i.e: how long does it take to train a model, for
different sample sizes.).

AFAIU, the size of the 'K' or the window length (or the worst case, whether
to train as streaming or as batch), is strictly determined by the above
results.

Regards,
Supun

On Wed, May 25, 2016 at 7:05 PM, Mahesh Dananjaya  wrote:

> Hi Maheshakya,
> this is the one i need to post.please check.
>
> Today we had a initial project meeting with wso2 ml team. what we
> discussed architecture, best approaches and scope of the entire project. As
> i underestood there will be three main component in the design.
> 1. CEP Siddhi Extension for the CEP streaming data interface
> 2. Core with Sample data points, recently saved model
> 3. Apache Spark MLLib algorithm wrapper
>
> As long as we need to facilitate the streaming data support for the cep
> siddhi, what i need initially is to developed a siddhi extension to get
> events streams into my program. As we discussed the best approach is the
> CEP Siddhi extension with Stream Processor extension to get cep event
> streams into the program to incrementally learn the model to predictions
> and analysis [1]. Extension will be the interface for cep to send data to
> program. There can be different interface for different applications  to
> use the program. This stream data from cep is taken as a batches rather
> than single events.And also there will be outpur from my program to cep
> which are the recent model information and parameters such as MSE, Cluster
> center etc. After that there can be two approaches that we have not
> finalized.
>
> 1. Collect K-Size batch from incoming data and learn model with that mini
> batch and store the model.In this case memory requirement depends on K and
> the number of features of the event. But this way we can achieve high level
> of streaming perspectives such as data obsolesces and data horizon while
> keeping relevant data while removing irrelevant data from the model
> training.
> 2. Collect data into large memory and use moving window of K size n shift.
> Where the n is the number if points that the window is moved after one
> calculation. In that case we need a large memory.
>
> Another approach that raised in that store the events/data points in a
> database and use them later. As we discussed there can be two approaches to
> send the updated/learned model into customer side. Time based and size
> based approach. In that case  there can be a time window (one day, one
> week, etc.) or a batch size (or both) in the K-size batch approach.
>
> Then the other component, the wrapper around the spakl mllib SGD based
> algorithms to for incremental learning. As i realized there will be memory
> constraints and other constraints when we incrementally learn models with
> stream data coming out of the cep, basically from the machine that cep is
> deployed. Therefore we need to look into timing and performance while we
> are using those algorithms on large data sets over time frequently.
> Initially what we supposed to do is that develop that extension for
> cep/siddhi to get stream data/events/sample points. after that we can move
> with for mllib libs. Now we have three algorithms, linear regression,
> k-mean clustering and logistic regression though we intially look only into
> fisrt two.so this wees will be spent to develop that extension. thank you.
>
> regards,
> Mahesh.
>
> [1]
> https://docs.wso2.com/display/CEP400/Writing+a+Custom+Stream+Processor+Extension
>



-- 
*Supun Sethunga*
Software Engineer
WSO2, Inc.
http://wso2.com/
lean | enterprise | middleware
Mobile : +94 716546324
___
Dev mailing list
Dev@wso2.org
http://wso2.org/cgi-bin/mailman/listinfo/dev


[Dev] [ml][cep][gsoc-6] Initial Project Meeting-Predictive Analysis with online data

2016-05-25 Thread Mahesh Dananjaya
Hi Maheshakya,
this is the one i need to post.please check.

Today we had a initial project meeting with wso2 ml team. what we discussed
architecture, best approaches and scope of the entire project. As i
underestood there will be three main component in the design.
1. CEP Siddhi Extension for the CEP streaming data interface
2. Core with Sample data points, recently saved model
3. Apache Spark MLLib algorithm wrapper

As long as we need to facilitate the streaming data support for the cep
siddhi, what i need initially is to developed a siddhi extension to get
events streams into my program. As we discussed the best approach is the
CEP Siddhi extension with Stream Processor extension to get cep event
streams into the program to incrementally learn the model to predictions
and analysis [1]. Extension will be the interface for cep to send data to
program. There can be different interface for different applications  to
use the program. This stream data from cep is taken as a batches rather
than single events.And also there will be outpur from my program to cep
which are the recent model information and parameters such as MSE, Cluster
center etc. After that there can be two approaches that we have not
finalized.

1. Collect K-Size batch from incoming data and learn model with that mini
batch and store the model.In this case memory requirement depends on K and
the number of features of the event. But this way we can achieve high level
of streaming perspectives such as data obsolesces and data horizon while
keeping relevant data while removing irrelevant data from the model
training.
2. Collect data into large memory and use moving window of K size n shift.
Where the n is the number if points that the window is moved after one
calculation. In that case we need a large memory.

Another approach that raised in that store the events/data points in a
database and use them later. As we discussed there can be two approaches to
send the updated/learned model into customer side. Time based and size
based approach. In that case  there can be a time window (one day, one
week, etc.) or a batch size (or both) in the K-size batch approach.

Then the other component, the wrapper around the spakl mllib SGD based
algorithms to for incremental learning. As i realized there will be memory
constraints and other constraints when we incrementally learn models with
stream data coming out of the cep, basically from the machine that cep is
deployed. Therefore we need to look into timing and performance while we
are using those algorithms on large data sets over time frequently.
Initially what we supposed to do is that develop that extension for
cep/siddhi to get stream data/events/sample points. after that we can move
with for mllib libs. Now we have three algorithms, linear regression,
k-mean clustering and logistic regression though we intially look only into
fisrt two.so this wees will be spent to develop that extension. thank you.

regards,
Mahesh.

[1]
https://docs.wso2.com/display/CEP400/Writing+a+Custom+Stream+Processor+Extension
___
Dev mailing list
Dev@wso2.org
http://wso2.org/cgi-bin/mailman/listinfo/dev