Sorry I missed the community call this morning. I heard that this was discussed in more detail, but haven't seen the minutes of the call posted yet. Here are a couple more thoughts on this:
The matrix operation based implementation offered by Guatam is intuitive and logical way of describing the algorithm, if we had an efficient way of expressing algorithms like this it would greatly simply the process of adding new algorithms and lower the barrier to entry for contributions to the project. Which would be a good thing, so I wanted to spend a bit more thought on what this would take and why this solution is not efficient today. Primarily the existing implementation we have for calculating X_T_X in MADlib is singnificantly more efficient than the implementation within madlib.matrix_mult(), but the implementation in madlib.matrix_mult() is much more general purpose. The existing implementation is hard coded to handle the fact that both X and t(X) are operating on the same matrix and that this specific calculation is such that each row of the matrix becomes the column in the transpose that it is multiplied with meaning that if we have all the data for the row then the contributions from that row can be calculated without any additional redistribution of data. Further since they are the same table we don't have to join the two tables together to get that data and we can complete the entire operation with a single scan of one table. We do not seem to have the optimization for this very special case enabled in madlib.matrix_mult() resulting in the implementation of the multiplication being substantially slower. Similarly for X_T_Y in our typical cases X and Y are both in the same initial input table and in some ways we can think of "XY" as a single matrix that we have simply sliced vertically to produce X and Y as separate matrices, this means that despite X and Y being different matrices from the mathematical expression of the model we can still use the same in-place logic that we used for X_T_X. As expressed in the current madlib.matrix_mult() api there is no easy way for matrix_mult to recognize this relationship and so we end up forced to go the inefficient route even if we added the special case optimization when the left and right sides of the multiplication are transpositions of the same matrix. One path forward that would help make this type of implementation viable would be by adding some of these optimizations and possible api enhancements into matrix_mult code so that we can get the implementation more efficient going this route we could probably get from 30X perfomance hit down to only 2X performance hit - based on having to make separate scans for X_T_X and X_T_Y rather than being able to combine both calculations in a single scan of the data. Reducing that last 2X would take more effort and a greater level of sophistication in our optimization routines. The general case would likely require some amount of code generation. Regards, Caleb On Thu, Jan 14, 2016 at 5:32 PM, Caleb Welton <[email protected]> wrote: > Great seeing the prototype work here, I'm sure that there is something > that we can find from this work that we can bring into MADlib. > > However... It is a very different implementation from the existing > algorithms, calling into the madlib matrix functions directly rather than > having the majority of the work done within the abstraction layer. > Unfortunately this leads to a very inefficient implementation. > > As demonstration of this I ran this test case: > > Dataset: 1 dependent variable, 4 independent variables + intercept, > 10,000,00 observations > > Run using Postgres 9.4 on a Macbook Pro: > > Creating the X matrix from source table: 13.9s > Creating the Y matrix from source table: 9.1s > Computing X_T_X via matrix_mult: 169.2s > Computing X_T_Y via matrix_mult: 114.8s > > Calling madlib.linregr_train directly (implicitly calculates all of the > above as well as inverting the X_T_X matrix and calculating some other > statistics): 10.3s > > So in total about 30X slower than our existing methodology for doing the > same calculations. I would expect this delta to potentially get even > larger if it was to move from Postgres to Greenplum or HAWQ where we would > be able to start applying parallelism. (the specialized XtX multiplication > in linregr parallelizes perfectly, but the more general matrix_mult > functionality may not) > > As performance has been a key aspect to our development I'm not sure that > we want to architecturally go down the path outlined in this example code. > > That said... I can certainly see how this layer of abstraction could be a > valuable way of expressing things from a development perspective so the > question for the development community is if there is a way that we can > enable people to write code more similar to what Guatam has expressed while > preserving the performance of our existing implementations? > > The ideas that come to mind would be to take an API abstraction approach > more akin to what we can see in Theano where we can express a series of > matrix transformations abstractly and then let the framework work out the > best way to calculate the pipeline? Large project to do that... but it > could one answer to the long held question "how should we define our python > abstraction layer?". > > As a whole I'd be pretty resistant to adding dependencies on numpy/scipy > unless there was a compelling use case where the performance overhead of > implementing the MATH (instead of the control flow) in python was not > unacceptably large. > > -Caleb > > On Thu, Dec 24, 2015 at 12:51 PM, Frank McQuillan <[email protected]> > wrote: > >> Gautam, >> >> Thank you for working on this, it can be a great addition to MADlib. Cpl >> comments below: >> >> 0) Dependencies on numpy and scipy. Currently the platforms PostgreSQL, >> GPDB and HAWQ do not ship with numpy or scipy by default, so we may need >> to >> look at this dependency more closely. >> >> 2a,b) The following creation methods exist will exist MADlib 1.9. They >> are >> already in the MADlib code base: >> >> -- Create a matrix initialized with ones of given row and column dimension >> matrix_ones( row_dim, col_dim, matrix_out, out_args) >> >> -- Create a matrix initialized with zeros of given row and column >> dimension >> matrix_zeros( row_dim, col_dim, matrix_out, out_args) >> >> -- Create an square identity matrix of size dim x dim >> matrix_identity( dim, matrix_out, out_args) >> >> -- Create a diag matrix initialized with given diagonal elements >> matrix_diag( diag_elements, matrix_out, out_args) >> >> 2c) As for “Sampling matrices and scalars from certain distributions. We >> could start with Gaussian (multi-variate), truncated normal, Wishart, >> Inverse-Wishart, Gamma, and Beta.” I created a JIRA for that here: >> https://issues.apache.org/jira/browse/MADLIB-940 >> I agree with your recommendation. >> >> 3) Pipelining >> * it’s an architecture question that I agree we need to address, to reduce >> disk I/O between steps >> * Could be a platform implementation, or we can think about if MADlib can >> do something on top of the existing platform by coming up with a way to >> chain operations in-memory >> >> 4) I would *strongly* encourage you to go the next/last mile and get this >> into MADlib. The community can help you do it. And as you say we need to >> figure out how/if to support numpy and scipy, or do MADlib functions via >> Eigen or Boost to handle alternatively. >> >> Frank >> >> On Thu, Dec 24, 2015 at 12:29 PM, Gautam Muralidhar < >> [email protected]> wrote: >> >> > > Hi Team MADlib, >> > > >> > > I managed to complete the implementation of the Bayesian analysis of >> the >> > binary Probit regression model on MPP. The code has been tested on the >> > greenplum sandbox VM and seems to work fine. You can find the code here: >> > > >> > > >> > >> https://github.com/gautamsm/data-science-on-mpp/tree/master/BayesianAnalysis >> > > >> > > In the git repo, probit_regression.ipynb is the stand alone python >> > implementation. To verify correctness, I compared against R's MCMCpack >> > library that can also be run in the Jupyter notebook! >> > > >> > > probit_regression_mpp.ipynb is the distributed implementation for >> > Greenplum or HAWQ. This uses the MADlib matrix operations heavily. In >> > addition, it also has dependencies on numpy and scipy. If you look at >> the >> > Gibbs Probit Driver function, you will see that the only operations in >> > memory are those that involve inverting a matrix (in this case, the >> > covariance matrix or the X_T_X matrix, whose dimensions equal the >> number of >> > features and hence, hopefully reasonable), sampling from a multivariate >> > normal, and handling the coefficients. >> > > >> > > A couple of observations based on my experience with the MADlib matrix >> > operations: >> > > >> > > 1. First of all, they are a real boon! Last year, we implemented the >> > auto encoder in MPP and we had to write our own matrix operations, which >> > was painful. So kudos to you guys! The Matrix operations meant that it >> took >> > me ~ 4 hours to complete the implementation in MPP. That is significant, >> > albeit I have experience with SQL and PL/Python. >> > > >> > > 2. It would be great if we can get the following matrix functionality >> in >> > MADlib at some point: >> > > a. Creating an identity matrix >> > > b. Creating a zero matrix >> > > c. Sampling matrices and scalars from certain distributions. We >> > could start with Gaussian (multi-variate), truncated normal, Wishart, >> > Inverse-Wishart, Gamma, and Beta. >> > > >> > > 3. I still do think that as a developer using MADlib matrix >> operations, >> > we need to write a lot of code, mainly due to the fact that we need to >> > create SQL tables in a pipeline. We should probably look to reduce this >> and >> > see if we can efficiently pipeline operations. >> > > >> > > 4. Lastly, I would like to see if this can end up in MADlib at some >> > point. But to end up in MADlib, we will need to implement the truncated >> > normal and multi-variate normal samplers. If we can perhaps carve out a >> > numpy and scipy dependent section in MADlib and make it clear that these >> > functions work only if numpy and scipy are installed, then that might >> > accelerate MADlib contributions from committers. >> > >> > Sent from my iPhone >> > >
