Following up on the suggestions made on this thread, I re-implemented the MCMC Probit and Logit regression using the MADlib crossprod function. The resulting implementations are orders of magnitude faster than the first implementation that used the matrix_mult function. The new implementation can be found here (look for the cell containing madlib.crossprod):
https://github.com/gautamsm/data-science-on-mpp/blob/master/BayesianAnalysis/probit_regression_mpp.ipynb https://github.com/gautamsm/data-science-on-mpp/blob/master/BayesianAnalysis/logit_regression_mpp.ipynb The new implementation has now given us a couple of MCMC-based algorithms that we can actually use. Best, Gautam On Fri, Jan 15, 2016 at 4:27 PM, Gautam Muralidhar < [email protected]> wrote: > Thanks for the comments, Rahul and Caleb. > > @Rahul - I will look to implement your suggestions. > > Best, > Gautam > > On Fri, Jan 15, 2016 at 3:27 PM, Rahul Iyer <[email protected]> wrote: > >> Thanks for your comments, Caleb. >> >> @Gautam: as I mentioned in the community call today, we have an >> aggregate function, crossprod(float8[], float8[]), that can be used to >> perform the X'X and X'Y operation. >> - for X'X, the row_vec column would be both vector inputs >> - for X'Y, the row_vec column of X would be the first input and the Y >> value as an array would be the 2nd input (crossprod needs to treat the >> Y as a 1x1 vector). >> You would, however, have to be careful of the X'X output - it's the >> matrix flattened into an array, so you would have to reshape it. >> >> As Caleb said, we would benefit by inspecting the distribution of the >> two input matrices in matrix_mult and switch between the currently >> implemented inner product and this crossprod aggregate (outer >> product). >> >> On Fri, Jan 15, 2016 at 2:52 PM, Caleb Welton <[email protected]> wrote: >> > Sorry I missed the community call this morning. I heard that this was >> > discussed in more detail, but haven't seen the minutes of the call >> posted >> > yet. Here are a couple more thoughts on this: >> > >> > The matrix operation based implementation offered by Guatam is intuitive >> > and logical way of describing the algorithm, if we had an efficient way >> of >> > expressing algorithms like this it would greatly simply the process of >> > adding new algorithms and lower the barrier to entry for contributions >> to >> > the project. Which would be a good thing, so I wanted to spend a bit >> more >> > thought on what this would take and why this solution is not efficient >> > today. >> > >> > Primarily the existing implementation we have for calculating X_T_X in >> > MADlib is singnificantly more efficient than the implementation within >> > madlib.matrix_mult(), but the implementation in madlib.matrix_mult() is >> > much more general purpose. The existing implementation is hard coded to >> > handle the fact that both X and t(X) are operating on the same matrix >> and >> > that this specific calculation is such that each row of the matrix >> becomes >> > the column in the transpose that it is multiplied with meaning that if >> we >> > have all the data for the row then the contributions from that row can >> be >> > calculated without any additional redistribution of data. Further since >> > they are the same table we don't have to join the two tables together to >> > get that data and we can complete the entire operation with a single >> scan >> > of one table. We do not seem to have the optimization for this very >> > special case enabled in madlib.matrix_mult() resulting in the >> > implementation of the multiplication being substantially slower. >> > >> > Similarly for X_T_Y in our typical cases X and Y are both in the same >> > initial input table and in some ways we can think of "XY" as a single >> > matrix that we have simply sliced vertically to produce X and Y as >> separate >> > matrices, this means that despite X and Y being different matrices from >> the >> > mathematical expression of the model we can still use the same in-place >> > logic that we used for X_T_X. As expressed in the current >> > madlib.matrix_mult() api there is no easy way for matrix_mult to >> recognize >> > this relationship and so we end up forced to go the inefficient route >> even >> > if we added the special case optimization when the left and right sides >> of >> > the multiplication are transpositions of the same matrix. >> > >> > One path forward that would help make this type of implementation viable >> > would be by adding some of these optimizations and possible api >> > enhancements into matrix_mult code so that we can get the implementation >> > more efficient going this route we could probably get from 30X >> perfomance >> > hit down to only 2X performance hit - based on having to make separate >> > scans for X_T_X and X_T_Y rather than being able to combine both >> > calculations in a single scan of the data. Reducing that last 2X would >> > take more effort and a greater level of sophistication in our >> optimization >> > routines. The general case would likely require some amount of code >> > generation. >> > >> > Regards, >> > Caleb >> > >> > On Thu, Jan 14, 2016 at 5:32 PM, Caleb Welton <[email protected]> >> wrote: >> > >> >> Great seeing the prototype work here, I'm sure that there is something >> >> that we can find from this work that we can bring into MADlib. >> >> >> >> However... It is a very different implementation from the existing >> >> algorithms, calling into the madlib matrix functions directly rather >> than >> >> having the majority of the work done within the abstraction layer. >> >> Unfortunately this leads to a very inefficient implementation. >> >> >> >> As demonstration of this I ran this test case: >> >> >> >> Dataset: 1 dependent variable, 4 independent variables + intercept, >> >> 10,000,00 observations >> >> >> >> Run using Postgres 9.4 on a Macbook Pro: >> >> >> >> Creating the X matrix from source table: 13.9s >> >> Creating the Y matrix from source table: 9.1s >> >> Computing X_T_X via matrix_mult: 169.2s >> >> Computing X_T_Y via matrix_mult: 114.8s >> >> >> >> Calling madlib.linregr_train directly (implicitly calculates all of the >> >> above as well as inverting the X_T_X matrix and calculating some other >> >> statistics): 10.3s >> >> >> >> So in total about 30X slower than our existing methodology for doing >> the >> >> same calculations. I would expect this delta to potentially get even >> >> larger if it was to move from Postgres to Greenplum or HAWQ where we >> would >> >> be able to start applying parallelism. (the specialized XtX >> multiplication >> >> in linregr parallelizes perfectly, but the more general matrix_mult >> >> functionality may not) >> >> >> >> As performance has been a key aspect to our development I'm not sure >> that >> >> we want to architecturally go down the path outlined in this example >> code. >> >> >> >> That said... I can certainly see how this layer of abstraction could >> be a >> >> valuable way of expressing things from a development perspective so the >> >> question for the development community is if there is a way that we can >> >> enable people to write code more similar to what Guatam has expressed >> while >> >> preserving the performance of our existing implementations? >> >> >> >> The ideas that come to mind would be to take an API abstraction >> approach >> >> more akin to what we can see in Theano where we can express a series of >> >> matrix transformations abstractly and then let the framework work out >> the >> >> best way to calculate the pipeline? Large project to do that... but it >> >> could one answer to the long held question "how should we define our >> python >> >> abstraction layer?". >> >> >> >> As a whole I'd be pretty resistant to adding dependencies on >> numpy/scipy >> >> unless there was a compelling use case where the performance overhead >> of >> >> implementing the MATH (instead of the control flow) in python was not >> >> unacceptably large. >> >> >> >> -Caleb >> >> >> >> On Thu, Dec 24, 2015 at 12:51 PM, Frank McQuillan < >> [email protected]> >> >> wrote: >> >> >> >>> Gautam, >> >>> >> >>> Thank you for working on this, it can be a great addition to MADlib. >> Cpl >> >>> comments below: >> >>> >> >>> 0) Dependencies on numpy and scipy. Currently the platforms >> PostgreSQL, >> >>> GPDB and HAWQ do not ship with numpy or scipy by default, so we may >> need >> >>> to >> >>> look at this dependency more closely. >> >>> >> >>> 2a,b) The following creation methods exist will exist MADlib 1.9. >> They >> >>> are >> >>> already in the MADlib code base: >> >>> >> >>> -- Create a matrix initialized with ones of given row and column >> dimension >> >>> matrix_ones( row_dim, col_dim, matrix_out, out_args) >> >>> >> >>> -- Create a matrix initialized with zeros of given row and column >> >>> dimension >> >>> matrix_zeros( row_dim, col_dim, matrix_out, out_args) >> >>> >> >>> -- Create an square identity matrix of size dim x dim >> >>> matrix_identity( dim, matrix_out, out_args) >> >>> >> >>> -- Create a diag matrix initialized with given diagonal elements >> >>> matrix_diag( diag_elements, matrix_out, out_args) >> >>> >> >>> 2c) As for “Sampling matrices and scalars from certain distributions. >> We >> >>> could start with Gaussian (multi-variate), truncated normal, Wishart, >> >>> Inverse-Wishart, Gamma, and Beta.” I created a JIRA for that here: >> >>> https://issues.apache.org/jira/browse/MADLIB-940 >> >>> I agree with your recommendation. >> >>> >> >>> 3) Pipelining >> >>> * it’s an architecture question that I agree we need to address, to >> reduce >> >>> disk I/O between steps >> >>> * Could be a platform implementation, or we can think about if MADlib >> can >> >>> do something on top of the existing platform by coming up with a way >> to >> >>> chain operations in-memory >> >>> >> >>> 4) I would *strongly* encourage you to go the next/last mile and get >> this >> >>> into MADlib. The community can help you do it. And as you say we >> need to >> >>> figure out how/if to support numpy and scipy, or do MADlib functions >> via >> >>> Eigen or Boost to handle alternatively. >> >>> >> >>> Frank >> >>> >> >>> On Thu, Dec 24, 2015 at 12:29 PM, Gautam Muralidhar < >> >>> [email protected]> wrote: >> >>> >> >>> > > Hi Team MADlib, >> >>> > > >> >>> > > I managed to complete the implementation of the Bayesian analysis >> of >> >>> the >> >>> > binary Probit regression model on MPP. The code has been tested on >> the >> >>> > greenplum sandbox VM and seems to work fine. You can find the code >> here: >> >>> > > >> >>> > > >> >>> > >> >>> >> https://github.com/gautamsm/data-science-on-mpp/tree/master/BayesianAnalysis >> >>> > > >> >>> > > In the git repo, probit_regression.ipynb is the stand alone python >> >>> > implementation. To verify correctness, I compared against R's >> MCMCpack >> >>> > library that can also be run in the Jupyter notebook! >> >>> > > >> >>> > > probit_regression_mpp.ipynb is the distributed implementation for >> >>> > Greenplum or HAWQ. This uses the MADlib matrix operations heavily. >> In >> >>> > addition, it also has dependencies on numpy and scipy. If you look >> at >> >>> the >> >>> > Gibbs Probit Driver function, you will see that the only operations >> in >> >>> > memory are those that involve inverting a matrix (in this case, the >> >>> > covariance matrix or the X_T_X matrix, whose dimensions equal the >> >>> number of >> >>> > features and hence, hopefully reasonable), sampling from a >> multivariate >> >>> > normal, and handling the coefficients. >> >>> > > >> >>> > > A couple of observations based on my experience with the MADlib >> matrix >> >>> > operations: >> >>> > > >> >>> > > 1. First of all, they are a real boon! Last year, we implemented >> the >> >>> > auto encoder in MPP and we had to write our own matrix operations, >> which >> >>> > was painful. So kudos to you guys! The Matrix operations meant that >> it >> >>> took >> >>> > me ~ 4 hours to complete the implementation in MPP. That is >> significant, >> >>> > albeit I have experience with SQL and PL/Python. >> >>> > > >> >>> > > 2. It would be great if we can get the following matrix >> functionality >> >>> in >> >>> > MADlib at some point: >> >>> > > a. Creating an identity matrix >> >>> > > b. Creating a zero matrix >> >>> > > c. Sampling matrices and scalars from certain distributions. >> We >> >>> > could start with Gaussian (multi-variate), truncated normal, >> Wishart, >> >>> > Inverse-Wishart, Gamma, and Beta. >> >>> > > >> >>> > > 3. I still do think that as a developer using MADlib matrix >> >>> operations, >> >>> > we need to write a lot of code, mainly due to the fact that we need >> to >> >>> > create SQL tables in a pipeline. We should probably look to reduce >> this >> >>> and >> >>> > see if we can efficiently pipeline operations. >> >>> > > >> >>> > > 4. Lastly, I would like to see if this can end up in MADlib at >> some >> >>> > point. But to end up in MADlib, we will need to implement the >> truncated >> >>> > normal and multi-variate normal samplers. If we can perhaps carve >> out a >> >>> > numpy and scipy dependent section in MADlib and make it clear that >> these >> >>> > functions work only if numpy and scipy are installed, then that >> might >> >>> > accelerate MADlib contributions from committers. >> >>> > >> >>> > Sent from my iPhone >> >>> >> >> >> >> >> > > > > -- > ========================================================== > Gautam Muralidhar > PhD, The University of Texas at Austin, USA. > ========================================================== > -- ========================================================== Gautam Muralidhar PhD, The University of Texas at Austin, USA. ==========================================================
