Hi Maheshakya,
this is the one i need to post.please check.

Today we had a initial project meeting with wso2 ml team. what we discussed
architecture, best approaches and scope of the entire project. As i
underestood there will be three main component in the design.
1. CEP Siddhi Extension for the CEP streaming data interface
2. Core with Sample data points, recently saved model
3. Apache Spark MLLib algorithm wrapper

As long as we need to facilitate the streaming data support for the cep
siddhi, what i need initially is to developed a siddhi extension to get
events streams into my program. As we discussed the best approach is the
CEP Siddhi extension with Stream Processor extension to get cep event
streams into the program to incrementally learn the model to predictions
and analysis [1]. Extension will be the interface for cep to send data to
program. There can be different interface for different applications  to
use the program. This stream data from cep is taken as a batches rather
than single events.And also there will be outpur from my program to cep
which are the recent model information and parameters such as MSE, Cluster
center etc. After that there can be two approaches that we have not
finalized.

1. Collect K-Size batch from incoming data and learn model with that mini
batch and store the model.In this case memory requirement depends on K and
the number of features of the event. But this way we can achieve high level
of streaming perspectives such as data obsolesces and data horizon while
keeping relevant data while removing irrelevant data from the model
training.
2. Collect data into large memory and use moving window of K size n shift.
Where the n is the number if points that the window is moved after one
calculation. In that case we need a large memory.

Another approach that raised in that store the events/data points in a
database and use them later. As we discussed there can be two approaches to
send the updated/learned model into customer side. Time based and size
based approach. In that case  there can be a time window (one day, one
week, etc.) or a batch size (or both) in the K-size batch approach.

Then the other component, the wrapper around the spakl mllib SGD based
algorithms to for incremental learning. As i realized there will be memory
constraints and other constraints when we incrementally learn models with
stream data coming out of the cep, basically from the machine that cep is
deployed. Therefore we need to look into timing and performance while we
are using those algorithms on large data sets over time frequently.
Initially what we supposed to do is that develop that extension for
cep/siddhi to get stream data/events/sample points. after that we can move
with for mllib libs. Now we have three algorithms, linear regression,
k-mean clustering and logistic regression though we intially look only into
fisrt two.so this wees will be spent to develop that extension. thank you.

regards,
Mahesh.

[1]
https://docs.wso2.com/display/CEP400/Writing+a+Custom+Stream+Processor+Extension
_______________________________________________
Dev mailing list
[email protected]
http://wso2.org/cgi-bin/mailman/listinfo/dev

Reply via email to