Great seeing the prototype work here, I'm sure that there is something that
we can find from this work that we can bring into MADlib.

However... It is a very different implementation from the existing
algorithms, calling into the madlib matrix functions directly rather than
having the majority of the work done within the abstraction layer.
Unfortunately this leads to a very inefficient implementation.

As demonstration of this I ran this test case:

Dataset: 1 dependent variable, 4 independent variables + intercept,
10,000,00 observations

Run using Postgres 9.4 on a Macbook Pro:

Creating the X matrix from source table: 13.9s
Creating the Y matrix from source table: 9.1s
Computing X_T_X via matrix_mult: 169.2s
Computing X_T_Y via matrix_mult: 114.8s

Calling madlib.linregr_train directly (implicitly calculates all of the
above as well as inverting the X_T_X matrix and calculating some other
statistics): 10.3s

So in total about 30X slower than our existing methodology for doing the
same calculations.  I would expect this delta to potentially get even
larger if it was to move from Postgres to Greenplum or HAWQ where we would
be able to start applying parallelism.  (the specialized XtX multiplication
in linregr parallelizes perfectly, but the more general matrix_mult
functionality may not)

As performance has been a key aspect to our development I'm not sure that
we want to architecturally go down the path outlined in this example code.

That said... I can certainly see how this layer of abstraction could be a
valuable way of expressing things from a development perspective so the
question for the development community is if there is a way that we can
enable people to write code more similar to what Guatam has expressed while
preserving the performance of our existing implementations?

The ideas that come to mind would be to take an API abstraction approach
more akin to what we can see in Theano where we can express a series of
matrix transformations abstractly and then let the framework work out the
best way to calculate the pipeline?  Large project to do that... but it
could one answer to the long held question "how should we define our python
abstraction layer?".

As a whole I'd be pretty resistant to adding dependencies on numpy/scipy
unless there was a compelling use case where the performance overhead of
implementing the MATH (instead of the control flow) in python was not
unacceptably large.

-Caleb

On Thu, Dec 24, 2015 at 12:51 PM, Frank McQuillan <[email protected]>
wrote:

> Gautam,
>
> Thank you for working on this, it can be a great addition to MADlib.  Cpl
> comments below:
>
> 0) Dependencies on numpy and scipy.  Currently the platforms PostgreSQL,
> GPDB and HAWQ do not ship with numpy or scipy by default, so we may need to
> look at this dependency more closely.
>
> 2a,b) The following creation methods exist will exist MADlib 1.9.  They are
> already in the MADlib code base:
>
> -- Create a matrix initialized with ones of given row and column dimension
>   matrix_ones( row_dim, col_dim, matrix_out, out_args)
>
> -- Create a matrix initialized with zeros of given row and column dimension
>   matrix_zeros( row_dim, col_dim, matrix_out, out_args)
>
> -- Create an square identity matrix of size dim x dim
>   matrix_identity( dim, matrix_out, out_args)
>
> -- Create a diag matrix initialized with given diagonal elements
>   matrix_diag( diag_elements, matrix_out, out_args)
>
> 2c) As for “Sampling matrices and scalars from certain distributions. We
> could start with Gaussian (multi-variate), truncated normal, Wishart,
> Inverse-Wishart, Gamma, and Beta.”  I created a JIRA for that here:
> https://issues.apache.org/jira/browse/MADLIB-940
> I agree with your recommendation.
>
> 3) Pipelining
> * it’s an architecture question that I agree we need to address, to reduce
> disk I/O between steps
> * Could be a platform implementation, or we can think about if MADlib can
> do something on top of the existing platform by coming up with a way to
> chain operations in-memory
>
> 4) I would *strongly* encourage you to go the next/last mile and get this
> into MADlib.  The community can help you do it.  And as you say we need to
> figure out how/if to support numpy and scipy, or do MADlib functions via
> Eigen or Boost to handle alternatively.
>
> Frank
>
> On Thu, Dec 24, 2015 at 12:29 PM, Gautam Muralidhar <
> [email protected]> wrote:
>
> > > Hi Team MADlib,
> > >
> > > I managed to complete the implementation of the Bayesian analysis of
> the
> > binary Probit regression model on MPP. The code has been tested on the
> > greenplum sandbox VM and seems to work fine. You can find the code here:
> > >
> > >
> >
> https://github.com/gautamsm/data-science-on-mpp/tree/master/BayesianAnalysis
> > >
> > > In the git repo, probit_regression.ipynb is the stand alone python
> > implementation. To verify correctness, I compared against R's MCMCpack
> > library that can also be run in the Jupyter notebook!
> > >
> > > probit_regression_mpp.ipynb is the distributed implementation for
> > Greenplum or HAWQ. This uses the MADlib matrix operations heavily. In
> > addition, it also has dependencies on numpy and scipy. If you look at the
> > Gibbs Probit Driver function, you will see that the only operations in
> > memory are those that involve inverting a matrix (in this case, the
> > covariance matrix or the X_T_X matrix, whose dimensions equal the number
> of
> > features and hence, hopefully reasonable), sampling from a multivariate
> > normal, and handling the coefficients.
> > >
> > > A couple of observations based on my experience with the MADlib matrix
> > operations:
> > >
> > > 1. First of all, they are a real boon! Last year, we implemented the
> > auto encoder in MPP and we had to write our own matrix operations, which
> > was painful. So kudos to you guys! The Matrix operations meant that it
> took
> > me ~ 4 hours to complete the implementation in MPP. That is significant,
> > albeit I have experience with SQL and PL/Python.
> > >
> > > 2. It would be great if we can get the following matrix functionality
> in
> > MADlib at some point:
> > >     a. Creating an identity matrix
> > >     b. Creating a zero matrix
> > >     c. Sampling matrices and scalars from certain distributions. We
> > could start with Gaussian (multi-variate), truncated normal, Wishart,
> > Inverse-Wishart, Gamma, and Beta.
> > >
> > > 3. I still do think that as a developer using MADlib matrix operations,
> > we need to write a lot of code, mainly due to the fact that we need to
> > create SQL tables in a pipeline. We should probably look to reduce this
> and
> > see if we can efficiently pipeline operations.
> > >
> > > 4. Lastly, I would like to see if this can end up in MADlib at some
> > point. But to end up in MADlib, we will need to implement the truncated
> > normal and multi-variate normal samplers. If we can perhaps carve out a
> > numpy and scipy dependent section in MADlib and make it clear that these
> > functions work only if numpy and scipy are installed, then that might
> > accelerate MADlib contributions from committers.
> >
> > Sent from my iPhone
>

Reply via email to