> Hi Team MADlib, > > I managed to complete the implementation of the Bayesian analysis of the > binary Probit regression model on MPP. The code has been tested on the > greenplum sandbox VM and seems to work fine. You can find the code here: > > https://github.com/gautamsm/data-science-on-mpp/tree/master/BayesianAnalysis > > In the git repo, probit_regression.ipynb is the stand alone python > implementation. To verify correctness, I compared against R's MCMCpack > library that can also be run in the Jupyter notebook! > > probit_regression_mpp.ipynb is the distributed implementation for Greenplum > or HAWQ. This uses the MADlib matrix operations heavily. In addition, it also > has dependencies on numpy and scipy. If you look at the Gibbs Probit Driver > function, you will see that the only operations in memory are those that > involve inverting a matrix (in this case, the covariance matrix or the X_T_X > matrix, whose dimensions equal the number of features and hence, hopefully > reasonable), sampling from a multivariate normal, and handling the > coefficients. > > A couple of observations based on my experience with the MADlib matrix > operations: > > 1. First of all, they are a real boon! Last year, we implemented the auto > encoder in MPP and we had to write our own matrix operations, which was > painful. So kudos to you guys! The Matrix operations meant that it took me ~ > 4 hours to complete the implementation in MPP. That is significant, albeit I > have experience with SQL and PL/Python. > > 2. It would be great if we can get the following matrix functionality in > MADlib at some point: > a. Creating an identity matrix > b. Creating a zero matrix > c. Sampling matrices and scalars from certain distributions. We could > start with Gaussian (multi-variate), truncated normal, Wishart, > Inverse-Wishart, Gamma, and Beta. > > 3. I still do think that as a developer using MADlib matrix operations, we > need to write a lot of code, mainly due to the fact that we need to create > SQL tables in a pipeline. We should probably look to reduce this and see if > we can efficiently pipeline operations. > > 4. Lastly, I would like to see if this can end up in MADlib at some point. > But to end up in MADlib, we will need to implement the truncated normal and > multi-variate normal samplers. If we can perhaps carve out a numpy and scipy > dependent section in MADlib and make it clear that these functions work only > if numpy and scipy are installed, then that might accelerate MADlib > contributions from committers.
Sent from my iPhone
