> Hi Team MADlib,
> 
> I managed to complete the implementation of the Bayesian analysis of the 
> binary Probit regression model on MPP. The code has been tested on the 
> greenplum sandbox VM and seems to work fine. You can find the code here:
> 
> https://github.com/gautamsm/data-science-on-mpp/tree/master/BayesianAnalysis
> 
> In the git repo, probit_regression.ipynb is the stand alone python 
> implementation. To verify correctness, I compared against R's MCMCpack 
> library that can also be run in the Jupyter notebook!
> 
> probit_regression_mpp.ipynb is the distributed implementation for Greenplum 
> or HAWQ. This uses the MADlib matrix operations heavily. In addition, it also 
> has dependencies on numpy and scipy. If you look at the Gibbs Probit Driver 
> function, you will see that the only operations in memory are those that 
> involve inverting a matrix (in this case, the covariance matrix or the X_T_X 
> matrix, whose dimensions equal the number of features and hence, hopefully 
> reasonable), sampling from a multivariate normal, and handling the 
> coefficients.
> 
> A couple of observations based on my experience with the MADlib matrix 
> operations:
> 
> 1. First of all, they are a real boon! Last year, we implemented the auto 
> encoder in MPP and we had to write our own matrix operations, which was 
> painful. So kudos to you guys! The Matrix operations meant that it took me ~ 
> 4 hours to complete the implementation in MPP. That is significant, albeit I 
> have experience with SQL and PL/Python.
> 
> 2. It would be great if we can get the following matrix functionality in 
> MADlib at some point:
>     a. Creating an identity matrix
>     b. Creating a zero matrix
>     c. Sampling matrices and scalars from certain distributions. We could 
> start with Gaussian (multi-variate), truncated normal, Wishart, 
> Inverse-Wishart, Gamma, and Beta.
> 
> 3. I still do think that as a developer using MADlib matrix operations, we 
> need to write a lot of code, mainly due to the fact that we need to create 
> SQL tables in a pipeline. We should probably look to reduce this and see if 
> we can efficiently pipeline operations. 
> 
> 4. Lastly, I would like to see if this can end up in MADlib at some point. 
> But to end up in MADlib, we will need to implement the truncated normal and 
> multi-variate normal samplers. If we can perhaps carve out a numpy and scipy 
> dependent section in MADlib and make it clear that these functions work only 
> if numpy and scipy are installed, then that might accelerate MADlib 
> contributions from committers.

Sent from my iPhone

Reply via email to