Hi Maheshakya, this is the one i need to post.please check. Today we had a initial project meeting with wso2 ml team. what we discussed architecture, best approaches and scope of the entire project. As i underestood there will be three main component in the design. 1. CEP Siddhi Extension for the CEP streaming data interface 2. Core with Sample data points, recently saved model 3. Apache Spark MLLib algorithm wrapper
As long as we need to facilitate the streaming data support for the cep siddhi, what i need initially is to developed a siddhi extension to get events streams into my program. As we discussed the best approach is the CEP Siddhi extension with Stream Processor extension to get cep event streams into the program to incrementally learn the model to predictions and analysis [1]. Extension will be the interface for cep to send data to program. There can be different interface for different applications to use the program. This stream data from cep is taken as a batches rather than single events.And also there will be outpur from my program to cep which are the recent model information and parameters such as MSE, Cluster center etc. After that there can be two approaches that we have not finalized. 1. Collect K-Size batch from incoming data and learn model with that mini batch and store the model.In this case memory requirement depends on K and the number of features of the event. But this way we can achieve high level of streaming perspectives such as data obsolesces and data horizon while keeping relevant data while removing irrelevant data from the model training. 2. Collect data into large memory and use moving window of K size n shift. Where the n is the number if points that the window is moved after one calculation. In that case we need a large memory. Another approach that raised in that store the events/data points in a database and use them later. As we discussed there can be two approaches to send the updated/learned model into customer side. Time based and size based approach. In that case there can be a time window (one day, one week, etc.) or a batch size (or both) in the K-size batch approach. Then the other component, the wrapper around the spakl mllib SGD based algorithms to for incremental learning. As i realized there will be memory constraints and other constraints when we incrementally learn models with stream data coming out of the cep, basically from the machine that cep is deployed. Therefore we need to look into timing and performance while we are using those algorithms on large data sets over time frequently. Initially what we supposed to do is that develop that extension for cep/siddhi to get stream data/events/sample points. after that we can move with for mllib libs. Now we have three algorithms, linear regression, k-mean clustering and logistic regression though we intially look only into fisrt two.so this wees will be spent to develop that extension. thank you. regards, Mahesh. [1] https://docs.wso2.com/display/CEP400/Writing+a+Custom+Stream+Processor+Extension
_______________________________________________ Dev mailing list [email protected] http://wso2.org/cgi-bin/mailman/listinfo/dev
