Hi MADlib Developers, I have implemented Steffen Rendle's Factorization Machines <http://www.csie.ntu.edu.tw/~b97053/paper/Rendle2010FM.pdf> on Python and MPP. You can find a pure python implementation and it's MPP variant for Greenplum and HAWQ here:
1. Factorization Machines - Pure Python <https://github.com/gautamsm/data-science-on-mpp/blob/master/ml_algorithms/factorization_machines.ipynb> 2. Factorization Machines - MPP <https://github.com/gautamsm/data-science-on-mpp/blob/master/ml_algorithms/factorization_machines_mpp.ipynb> *A quick summary of factorization machines:* Typically, when building a recommender system, you not only have tuple interactions (user-item, user-offer, user- TV show, user-purchase, etc.), but also side information in the form of user and item features. Historically, algorithms such as SVM and dense kernel methods have not been good at estimating parameters for recommenders owing to the sparsity of the tuple interactions. Bayesian collaborative filtering approaches have had success, but incorporation of side information entails fairly complex hierarchical modeling resulting in MCMC or Gibbs sampling for inference of model parameters. Rendle's work is an extremely interesting and novel way of thinking about this problem and seamlessly blending tuple interactions with side information. Since tuple interactions are modeled via a low rank matrix decomposition, the quality of parameter estimation will not suffer from sparsity unlike SVMs. I found the approach really elegant and further research and few great posts on Quora (one by a leading name <https://www.quora.com/What-are-some-of-the-best-recommendation-algorithms>, Xavier Amatrian) convinced me about the merit of this work. A couple of key points about factorization machines: 1. They can be seen as a generalization of SVMs and SVD++ 2. *They can be applied to any real valued feature vector for regression or classification problems (think in place of linear/logistic regression, SVMs etc.)* *A quick note on my MPP implementation:* Function interface: fm_fit ( src_table varchar, id_col_name varchar, x_col_name varchar, y_col_name varchar, parameters varchar, out_table varchar ) Parameters of the fm_fit function: source_table - table containing the feature vectors and dependent variable as two separate columns id_col_name - an id column for each row x_col_name - features y_col_name - dependent variable parameters - learning parameters: regularizers (lambda0, lambda1, lambda2), number of iterations (n_iter), learning rate (learning_rate), factorization hyper-parameter (k), random seed (integer for batch-sgd), optim (batch or batch-std), and out_table - name of the table where the model will be stored 1. The MPP implementation has been currently implemented for the regression problem (squared loss function). I plan to implement the classification problem soon (hinge/log loss). 2. The implementation assumes your training data is randomly distributed. Two gradient descent variants are provided: mini-batch SGD and regular batch gradient descent. The mini-batch SGD implementation is very closely based on this paper by Smola et al. <http://www.research.rutgers.edu/~lihong/pub/Zinkevich11Parallelized.pdf> 3. Lastly, the order complexity of factorization machines is roughly O(TKN) where T is number of SGD iterations, K is the number of latent factors, and N is the feature dimension. The implementation involves a couple of loops for estimating the parameters of the tuple interactions, which I couldn't vectorize (I could, but the vectorization meant building 3-D matrices, which was even slower). Consequently, an implementation of the SGD in C or C++ (as in MADlib) will be much faster than the PL/Python approach. Many thanks to Xiaocheng Teng (MADlib committer) for discussion on parallel SGD algrotihms. Does the community feel this would be a good addition to MADlib? Regards, Gautam -- ------------------------------------------------------------------ Gautam S. Muralidhar, Ph.D. Sr. Data Scientist, Pivotal Software, Inc. ------------------------------------------------------------------
