Re: Bayesian Analysis using MADlib (Gibbs Sampling for Probit Regression)

Rahul Iyer Fri, 15 Jan 2016 15:28:18 -0800

Thanks for your comments, Caleb.

@Gautam: as I mentioned in the community call today, we have an
aggregate function, crossprod(float8[], float8[]), that can be used to
perform the X'X and X'Y operation.
- for X'X, the row_vec column would be both vector inputs
- for X'Y, the row_vec column of X would be the first input and the Y
value as an array would be the 2nd input (crossprod needs to treat the
Y as a 1x1 vector).
You would, however, have to be careful of the X'X output - it's the
matrix flattened into an array, so you would have to reshape it.


As Caleb said, we would benefit by inspecting the distribution of the
two input matrices in matrix_mult and switch between the currently
implemented inner product and this crossprod aggregate (outer
product).

On Fri, Jan 15, 2016 at 2:52 PM, Caleb Welton <cwel...@pivotal.io> wrote:
> Sorry I missed the community call this morning.  I heard that this was
> discussed in more detail, but haven't seen the minutes of the call posted
> yet.  Here are a couple more thoughts on this:
>
> The matrix operation based implementation offered by Guatam is intuitive
> and logical way of describing the algorithm, if we had an efficient way of
> expressing algorithms like this it would greatly simply the process of
> adding new algorithms and lower the barrier to entry for contributions to
> the project.  Which would be a good thing, so I wanted to spend a bit more
> thought on what this would take and why this solution is not efficient
> today.
>
> Primarily the existing implementation we have for calculating X_T_X in
> MADlib is singnificantly more efficient than the implementation within
> madlib.matrix_mult(), but the implementation in madlib.matrix_mult() is
> much more general purpose.  The existing implementation is hard coded to
> handle the fact that both X and t(X) are operating on the same matrix and
> that this specific calculation is such that each row of the matrix becomes
> the column in the transpose that it is multiplied with meaning that if we
> have all the data for the row then the contributions from that row can be
> calculated without any additional redistribution of data.  Further since
> they are the same table we don't have to join the two tables together to
> get that data and we can complete the entire operation with a single scan
> of one table.  We do not seem to have the optimization for this very
> special case enabled in madlib.matrix_mult() resulting in the
> implementation of the multiplication being substantially slower.
>
> Similarly for X_T_Y in our typical cases X and Y are both in the same
> initial input table and in some ways we can think of "XY" as a single
> matrix that we have simply sliced vertically to produce X and Y as separate
> matrices, this means that despite X and Y being different matrices from the
> mathematical expression of the model we can still use the same in-place
> logic that we used for X_T_X.  As expressed in the current
> madlib.matrix_mult() api there is no easy way for matrix_mult to recognize
> this relationship and so we end up forced to go the inefficient route even
> if we added the special case optimization when the left and right sides of
> the multiplication are transpositions of the same matrix.
>
> One path forward that would help make this type of implementation viable
> would be by adding some of these optimizations and possible api
> enhancements into matrix_mult code so that we can get the implementation
> more efficient going this route we could probably get from 30X perfomance
> hit down to only 2X performance hit - based on having to make separate
> scans for X_T_X and X_T_Y rather than being able to combine both
> calculations in a single scan of the data.  Reducing that last 2X would
> take more effort and a greater level of sophistication in our optimization
> routines.  The general case would likely require some amount of code
> generation.
>
> Regards,
>   Caleb
>
> On Thu, Jan 14, 2016 at 5:32 PM, Caleb Welton <cwel...@pivotal.io> wrote:
>
>> Great seeing the prototype work here, I'm sure that there is something
>> that we can find from this work that we can bring into MADlib.
>>
>> However... It is a very different implementation from the existing
>> algorithms, calling into the madlib matrix functions directly rather than
>> having the majority of the work done within the abstraction layer.
>> Unfortunately this leads to a very inefficient implementation.
>>
>> As demonstration of this I ran this test case:
>>
>> Dataset: 1 dependent variable, 4 independent variables + intercept,
>> 10,000,00 observations
>>
>> Run using Postgres 9.4 on a Macbook Pro:
>>
>> Creating the X matrix from source table: 13.9s
>> Creating the Y matrix from source table: 9.1s
>> Computing X_T_X via matrix_mult: 169.2s
>> Computing X_T_Y via matrix_mult: 114.8s
>>
>> Calling madlib.linregr_train directly (implicitly calculates all of the
>> above as well as inverting the X_T_X matrix and calculating some other
>> statistics): 10.3s
>>
>> So in total about 30X slower than our existing methodology for doing the
>> same calculations.  I would expect this delta to potentially get even
>> larger if it was to move from Postgres to Greenplum or HAWQ where we would
>> be able to start applying parallelism.  (the specialized XtX multiplication
>> in linregr parallelizes perfectly, but the more general matrix_mult
>> functionality may not)
>>
>> As performance has been a key aspect to our development I'm not sure that
>> we want to architecturally go down the path outlined in this example code.
>>
>> That said... I can certainly see how this layer of abstraction could be a
>> valuable way of expressing things from a development perspective so the
>> question for the development community is if there is a way that we can
>> enable people to write code more similar to what Guatam has expressed while
>> preserving the performance of our existing implementations?
>>
>> The ideas that come to mind would be to take an API abstraction approach
>> more akin to what we can see in Theano where we can express a series of
>> matrix transformations abstractly and then let the framework work out the
>> best way to calculate the pipeline?  Large project to do that... but it
>> could one answer to the long held question "how should we define our python
>> abstraction layer?".
>>
>> As a whole I'd be pretty resistant to adding dependencies on numpy/scipy
>> unless there was a compelling use case where the performance overhead of
>> implementing the MATH (instead of the control flow) in python was not
>> unacceptably large.
>>
>> -Caleb
>>
>> On Thu, Dec 24, 2015 at 12:51 PM, Frank McQuillan <fmcquil...@pivotal.io>
>> wrote:
>>
>>> Gautam,
>>>
>>> Thank you for working on this, it can be a great addition to MADlib.  Cpl
>>> comments below:
>>>
>>> 0) Dependencies on numpy and scipy.  Currently the platforms PostgreSQL,
>>> GPDB and HAWQ do not ship with numpy or scipy by default, so we may need
>>> to
>>> look at this dependency more closely.
>>>
>>> 2a,b) The following creation methods exist will exist MADlib 1.9.  They
>>> are
>>> already in the MADlib code base:
>>>
>>> -- Create a matrix initialized with ones of given row and column dimension
>>>   matrix_ones( row_dim, col_dim, matrix_out, out_args)
>>>
>>> -- Create a matrix initialized with zeros of given row and column
>>> dimension
>>>   matrix_zeros( row_dim, col_dim, matrix_out, out_args)
>>>
>>> -- Create an square identity matrix of size dim x dim
>>>   matrix_identity( dim, matrix_out, out_args)
>>>
>>> -- Create a diag matrix initialized with given diagonal elements
>>>   matrix_diag( diag_elements, matrix_out, out_args)
>>>
>>> 2c) As for “Sampling matrices and scalars from certain distributions. We
>>> could start with Gaussian (multi-variate), truncated normal, Wishart,
>>> Inverse-Wishart, Gamma, and Beta.”  I created a JIRA for that here:
>>> https://issues.apache.org/jira/browse/MADLIB-940
>>> I agree with your recommendation.
>>>
>>> 3) Pipelining
>>> * it’s an architecture question that I agree we need to address, to reduce
>>> disk I/O between steps
>>> * Could be a platform implementation, or we can think about if MADlib can
>>> do something on top of the existing platform by coming up with a way to
>>> chain operations in-memory
>>>
>>> 4) I would *strongly* encourage you to go the next/last mile and get this
>>> into MADlib.  The community can help you do it.  And as you say we need to
>>> figure out how/if to support numpy and scipy, or do MADlib functions via
>>> Eigen or Boost to handle alternatively.
>>>
>>> Frank
>>>
>>> On Thu, Dec 24, 2015 at 12:29 PM, Gautam Muralidhar <
>>> gautam.s.muralid...@gmail.com> wrote:
>>>
>>> > > Hi Team MADlib,
>>> > >
>>> > > I managed to complete the implementation of the Bayesian analysis of
>>> the
>>> > binary Probit regression model on MPP. The code has been tested on the
>>> > greenplum sandbox VM and seems to work fine. You can find the code here:
>>> > >
>>> > >
>>> >
>>> https://github.com/gautamsm/data-science-on-mpp/tree/master/BayesianAnalysis
>>> > >
>>> > > In the git repo, probit_regression.ipynb is the stand alone python
>>> > implementation. To verify correctness, I compared against R's MCMCpack
>>> > library that can also be run in the Jupyter notebook!
>>> > >
>>> > > probit_regression_mpp.ipynb is the distributed implementation for
>>> > Greenplum or HAWQ. This uses the MADlib matrix operations heavily. In
>>> > addition, it also has dependencies on numpy and scipy. If you look at
>>> the
>>> > Gibbs Probit Driver function, you will see that the only operations in
>>> > memory are those that involve inverting a matrix (in this case, the
>>> > covariance matrix or the X_T_X matrix, whose dimensions equal the
>>> number of
>>> > features and hence, hopefully reasonable), sampling from a multivariate
>>> > normal, and handling the coefficients.
>>> > >
>>> > > A couple of observations based on my experience with the MADlib matrix
>>> > operations:
>>> > >
>>> > > 1. First of all, they are a real boon! Last year, we implemented the
>>> > auto encoder in MPP and we had to write our own matrix operations, which
>>> > was painful. So kudos to you guys! The Matrix operations meant that it
>>> took
>>> > me ~ 4 hours to complete the implementation in MPP. That is significant,
>>> > albeit I have experience with SQL and PL/Python.
>>> > >
>>> > > 2. It would be great if we can get the following matrix functionality
>>> in
>>> > MADlib at some point:
>>> > >     a. Creating an identity matrix
>>> > >     b. Creating a zero matrix
>>> > >     c. Sampling matrices and scalars from certain distributions. We
>>> > could start with Gaussian (multi-variate), truncated normal, Wishart,
>>> > Inverse-Wishart, Gamma, and Beta.
>>> > >
>>> > > 3. I still do think that as a developer using MADlib matrix
>>> operations,
>>> > we need to write a lot of code, mainly due to the fact that we need to
>>> > create SQL tables in a pipeline. We should probably look to reduce this
>>> and
>>> > see if we can efficiently pipeline operations.
>>> > >
>>> > > 4. Lastly, I would like to see if this can end up in MADlib at some
>>> > point. But to end up in MADlib, we will need to implement the truncated
>>> > normal and multi-variate normal samplers. If we can perhaps carve out a
>>> > numpy and scipy dependent section in MADlib and make it clear that these
>>> > functions work only if numpy and scipy are installed, then that might
>>> > accelerate MADlib contributions from committers.
>>> >
>>> > Sent from my iPhone
>>>
>>
>>

Re: Bayesian Analysis using MADlib (Gibbs Sampling for Probit Regression)

Reply via email to