Gautam, Thank you for working on this, it can be a great addition to MADlib. Cpl comments below:
0) Dependencies on numpy and scipy. Currently the platforms PostgreSQL, GPDB and HAWQ do not ship with numpy or scipy by default, so we may need to look at this dependency more closely. 2a,b) The following creation methods exist will exist MADlib 1.9. They are already in the MADlib code base: -- Create a matrix initialized with ones of given row and column dimension matrix_ones( row_dim, col_dim, matrix_out, out_args) -- Create a matrix initialized with zeros of given row and column dimension matrix_zeros( row_dim, col_dim, matrix_out, out_args) -- Create an square identity matrix of size dim x dim matrix_identity( dim, matrix_out, out_args) -- Create a diag matrix initialized with given diagonal elements matrix_diag( diag_elements, matrix_out, out_args) 2c) As for “Sampling matrices and scalars from certain distributions. We could start with Gaussian (multi-variate), truncated normal, Wishart, Inverse-Wishart, Gamma, and Beta.” I created a JIRA for that here: https://issues.apache.org/jira/browse/MADLIB-940 I agree with your recommendation. 3) Pipelining * it’s an architecture question that I agree we need to address, to reduce disk I/O between steps * Could be a platform implementation, or we can think about if MADlib can do something on top of the existing platform by coming up with a way to chain operations in-memory 4) I would *strongly* encourage you to go the next/last mile and get this into MADlib. The community can help you do it. And as you say we need to figure out how/if to support numpy and scipy, or do MADlib functions via Eigen or Boost to handle alternatively. Frank On Thu, Dec 24, 2015 at 12:29 PM, Gautam Muralidhar < [email protected]> wrote: > > Hi Team MADlib, > > > > I managed to complete the implementation of the Bayesian analysis of the > binary Probit regression model on MPP. The code has been tested on the > greenplum sandbox VM and seems to work fine. You can find the code here: > > > > > https://github.com/gautamsm/data-science-on-mpp/tree/master/BayesianAnalysis > > > > In the git repo, probit_regression.ipynb is the stand alone python > implementation. To verify correctness, I compared against R's MCMCpack > library that can also be run in the Jupyter notebook! > > > > probit_regression_mpp.ipynb is the distributed implementation for > Greenplum or HAWQ. This uses the MADlib matrix operations heavily. In > addition, it also has dependencies on numpy and scipy. If you look at the > Gibbs Probit Driver function, you will see that the only operations in > memory are those that involve inverting a matrix (in this case, the > covariance matrix or the X_T_X matrix, whose dimensions equal the number of > features and hence, hopefully reasonable), sampling from a multivariate > normal, and handling the coefficients. > > > > A couple of observations based on my experience with the MADlib matrix > operations: > > > > 1. First of all, they are a real boon! Last year, we implemented the > auto encoder in MPP and we had to write our own matrix operations, which > was painful. So kudos to you guys! The Matrix operations meant that it took > me ~ 4 hours to complete the implementation in MPP. That is significant, > albeit I have experience with SQL and PL/Python. > > > > 2. It would be great if we can get the following matrix functionality in > MADlib at some point: > > a. Creating an identity matrix > > b. Creating a zero matrix > > c. Sampling matrices and scalars from certain distributions. We > could start with Gaussian (multi-variate), truncated normal, Wishart, > Inverse-Wishart, Gamma, and Beta. > > > > 3. I still do think that as a developer using MADlib matrix operations, > we need to write a lot of code, mainly due to the fact that we need to > create SQL tables in a pipeline. We should probably look to reduce this and > see if we can efficiently pipeline operations. > > > > 4. Lastly, I would like to see if this can end up in MADlib at some > point. But to end up in MADlib, we will need to implement the truncated > normal and multi-variate normal samplers. If we can perhaps carve out a > numpy and scipy dependent section in MADlib and make it clear that these > functions work only if numpy and scipy are installed, then that might > accelerate MADlib contributions from committers. > > Sent from my iPhone
