Hi, My name is Aditya Nain, and I am a graduate student at University of Florida. I have been learning MADLib for a while and want to contribute to MADLib. I went through some of the open stories in JIRA and started working on MADLIB-410 :
https://issues.apache.org/jira/browse/MADLIB-410?jql=project%20%3D%20MADLIB which is about implementing Gaussian Mixture Model using Expectation Maximization (EM) algorithm. I came across the following paper while searching for distributed EM algorithm which can be implemented in MADLib. Carlos Ordonez, Paul Cereghini "SQLEM: fast clustering in SQL using the EM algorithm" ACM SIGMOD Record, Volume 29 Issue 2, June 2000 Pages 559-570. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.28.7564 I thought of implementing the approach discussed in the paper, but the paper makes an assumption that the covariance martix is the same for all the clusters ( i.e covariance matrix is same for all the Gaussian distributions). So, I wanted to know the opinion of the community if it's fine to go with the assumption made in the paper and implement it in MADLib. Also, currently MADLib doesn't have an implementation of a perceptron, nor did I find any open story related to it in JIRA. I came across the following paper, which talks about a distributed algorithm for perceptron : Ryan McDonald, Keith Hall, Gideon Mann "Distributed training strategies for the structured perceptron" http://dl.acm.org/citation.cfm?id=1858068 Would it useful to have a distributed implementaion of perceptron in MADlib? Thanks, Aditya