Re: Bayesian Analysis using MADlib (Gibbs Sampling for Probit Regression)

Caleb Welton Fri, 15 Jan 2016 14:53:07 -0800

Sorry I missed the community call this morning.  I heard that this was
discussed in more detail, but haven't seen the minutes of the call posted
yet.  Here are a couple more thoughts on this:


The matrix operation based implementation offered by Guatam is intuitive
and logical way of describing the algorithm, if we had an efficient way of
expressing algorithms like this it would greatly simply the process of
adding new algorithms and lower the barrier to entry for contributions to
the project.  Which would be a good thing, so I wanted to spend a bit more
thought on what this would take and why this solution is not efficient
today.

Primarily the existing implementation we have for calculating X_T_X in
MADlib is singnificantly more efficient than the implementation within
madlib.matrix_mult(), but the implementation in madlib.matrix_mult() is
much more general purpose.  The existing implementation is hard coded to
handle the fact that both X and t(X) are operating on the same matrix and
that this specific calculation is such that each row of the matrix becomes
the column in the transpose that it is multiplied with meaning that if we
have all the data for the row then the contributions from that row can be
calculated without any additional redistribution of data.  Further since
they are the same table we don't have to join the two tables together to
get that data and we can complete the entire operation with a single scan
of one table.  We do not seem to have the optimization for this very
special case enabled in madlib.matrix_mult() resulting in the
implementation of the multiplication being substantially slower.

Similarly for X_T_Y in our typical cases X and Y are both in the same
initial input table and in some ways we can think of "XY" as a single
matrix that we have simply sliced vertically to produce X and Y as separate
matrices, this means that despite X and Y being different matrices from the
mathematical expression of the model we can still use the same in-place
logic that we used for X_T_X.  As expressed in the current
madlib.matrix_mult() api there is no easy way for matrix_mult to recognize
this relationship and so we end up forced to go the inefficient route even
if we added the special case optimization when the left and right sides of
the multiplication are transpositions of the same matrix.

One path forward that would help make this type of implementation viable
would be by adding some of these optimizations and possible api
enhancements into matrix_mult code so that we can get the implementation
more efficient going this route we could probably get from 30X perfomance
hit down to only 2X performance hit - based on having to make separate
scans for X_T_X and X_T_Y rather than being able to combine both
calculations in a single scan of the data.  Reducing that last 2X would
take more effort and a greater level of sophistication in our optimization
routines.  The general case would likely require some amount of code
generation.

Regards,
  Caleb

On Thu, Jan 14, 2016 at 5:32 PM, Caleb Welton <[email protected]> wrote:

> Great seeing the prototype work here, I'm sure that there is something
> that we can find from this work that we can bring into MADlib.
>
> However... It is a very different implementation from the existing
> algorithms, calling into the madlib matrix functions directly rather than
> having the majority of the work done within the abstraction layer.
> Unfortunately this leads to a very inefficient implementation.
>
> As demonstration of this I ran this test case:
>
> Dataset: 1 dependent variable, 4 independent variables + intercept,
> 10,000,00 observations
>
> Run using Postgres 9.4 on a Macbook Pro:
>
> Creating the X matrix from source table: 13.9s
> Creating the Y matrix from source table: 9.1s
> Computing X_T_X via matrix_mult: 169.2s
> Computing X_T_Y via matrix_mult: 114.8s
>
> Calling madlib.linregr_train directly (implicitly calculates all of the
> above as well as inverting the X_T_X matrix and calculating some other
> statistics): 10.3s
>
> So in total about 30X slower than our existing methodology for doing the
> same calculations.  I would expect this delta to potentially get even
> larger if it was to move from Postgres to Greenplum or HAWQ where we would
> be able to start applying parallelism.  (the specialized XtX multiplication
> in linregr parallelizes perfectly, but the more general matrix_mult
> functionality may not)
>
> As performance has been a key aspect to our development I'm not sure that
> we want to architecturally go down the path outlined in this example code.
>
> That said... I can certainly see how this layer of abstraction could be a
> valuable way of expressing things from a development perspective so the
> question for the development community is if there is a way that we can
> enable people to write code more similar to what Guatam has expressed while
> preserving the performance of our existing implementations?
>
> The ideas that come to mind would be to take an API abstraction approach
> more akin to what we can see in Theano where we can express a series of
> matrix transformations abstractly and then let the framework work out the
> best way to calculate the pipeline?  Large project to do that... but it
> could one answer to the long held question "how should we define our python
> abstraction layer?".
>
> As a whole I'd be pretty resistant to adding dependencies on numpy/scipy
> unless there was a compelling use case where the performance overhead of
> implementing the MATH (instead of the control flow) in python was not
> unacceptably large.
>
> -Caleb
>
> On Thu, Dec 24, 2015 at 12:51 PM, Frank McQuillan <[email protected]>
> wrote:
>
>> Gautam,
>>
>> Thank you for working on this, it can be a great addition to MADlib.  Cpl
>> comments below:
>>
>> 0) Dependencies on numpy and scipy.  Currently the platforms PostgreSQL,
>> GPDB and HAWQ do not ship with numpy or scipy by default, so we may need
>> to
>> look at this dependency more closely.
>>
>> 2a,b) The following creation methods exist will exist MADlib 1.9.  They
>> are
>> already in the MADlib code base:
>>
>> -- Create a matrix initialized with ones of given row and column dimension
>>   matrix_ones( row_dim, col_dim, matrix_out, out_args)
>>
>> -- Create a matrix initialized with zeros of given row and column
>> dimension
>>   matrix_zeros( row_dim, col_dim, matrix_out, out_args)
>>
>> -- Create an square identity matrix of size dim x dim
>>   matrix_identity( dim, matrix_out, out_args)
>>
>> -- Create a diag matrix initialized with given diagonal elements
>>   matrix_diag( diag_elements, matrix_out, out_args)
>>
>> 2c) As for “Sampling matrices and scalars from certain distributions. We
>> could start with Gaussian (multi-variate), truncated normal, Wishart,
>> Inverse-Wishart, Gamma, and Beta.”  I created a JIRA for that here:
>> https://issues.apache.org/jira/browse/MADLIB-940
>> I agree with your recommendation.
>>
>> 3) Pipelining
>> * it’s an architecture question that I agree we need to address, to reduce
>> disk I/O between steps
>> * Could be a platform implementation, or we can think about if MADlib can
>> do something on top of the existing platform by coming up with a way to
>> chain operations in-memory
>>
>> 4) I would *strongly* encourage you to go the next/last mile and get this
>> into MADlib.  The community can help you do it.  And as you say we need to
>> figure out how/if to support numpy and scipy, or do MADlib functions via
>> Eigen or Boost to handle alternatively.
>>
>> Frank
>>
>> On Thu, Dec 24, 2015 at 12:29 PM, Gautam Muralidhar <
>> [email protected]> wrote:
>>
>> > > Hi Team MADlib,
>> > >
>> > > I managed to complete the implementation of the Bayesian analysis of
>> the
>> > binary Probit regression model on MPP. The code has been tested on the
>> > greenplum sandbox VM and seems to work fine. You can find the code here:
>> > >
>> > >
>> >
>> https://github.com/gautamsm/data-science-on-mpp/tree/master/BayesianAnalysis
>> > >
>> > > In the git repo, probit_regression.ipynb is the stand alone python
>> > implementation. To verify correctness, I compared against R's MCMCpack
>> > library that can also be run in the Jupyter notebook!
>> > >
>> > > probit_regression_mpp.ipynb is the distributed implementation for
>> > Greenplum or HAWQ. This uses the MADlib matrix operations heavily. In
>> > addition, it also has dependencies on numpy and scipy. If you look at
>> the
>> > Gibbs Probit Driver function, you will see that the only operations in
>> > memory are those that involve inverting a matrix (in this case, the
>> > covariance matrix or the X_T_X matrix, whose dimensions equal the
>> number of
>> > features and hence, hopefully reasonable), sampling from a multivariate
>> > normal, and handling the coefficients.
>> > >
>> > > A couple of observations based on my experience with the MADlib matrix
>> > operations:
>> > >
>> > > 1. First of all, they are a real boon! Last year, we implemented the
>> > auto encoder in MPP and we had to write our own matrix operations, which
>> > was painful. So kudos to you guys! The Matrix operations meant that it
>> took
>> > me ~ 4 hours to complete the implementation in MPP. That is significant,
>> > albeit I have experience with SQL and PL/Python.
>> > >
>> > > 2. It would be great if we can get the following matrix functionality
>> in
>> > MADlib at some point:
>> > >     a. Creating an identity matrix
>> > >     b. Creating a zero matrix
>> > >     c. Sampling matrices and scalars from certain distributions. We
>> > could start with Gaussian (multi-variate), truncated normal, Wishart,
>> > Inverse-Wishart, Gamma, and Beta.
>> > >
>> > > 3. I still do think that as a developer using MADlib matrix
>> operations,
>> > we need to write a lot of code, mainly due to the fact that we need to
>> > create SQL tables in a pipeline. We should probably look to reduce this
>> and
>> > see if we can efficiently pipeline operations.
>> > >
>> > > 4. Lastly, I would like to see if this can end up in MADlib at some
>> > point. But to end up in MADlib, we will need to implement the truncated
>> > normal and multi-variate normal samplers. If we can perhaps carve out a
>> > numpy and scipy dependent section in MADlib and make it clear that these
>> > functions work only if numpy and scipy are installed, then that might
>> > accelerate MADlib contributions from committers.
>> >
>> > Sent from my iPhone
>>
>
>

Re: Bayesian Analysis using MADlib (Gibbs Sampling for Probit Regression)

Reply via email to