Thank you for the further analysis James. This certainly looks like something to change, in MADlib-2.0. Since 2.0 is a backward compatibility breaking release, it's just easier to have the same function name and just change its implementation. But more importantly, we will have ample time to do some serious performance testing, since array_dot is used internally by other MADlib modules such as SVM and SVD to name a few.
Given the performance impact this particular change could have, I vote to make the change. Since it would also incur lots of performance testing, I would vote to include this in the 2.0 release. NJ On Fri, Oct 20, 2017 at 1:59 PM, James Gregory <james....@gmail.com> wrote: > On 20 October 2017 at 18:43, Nandish Jayaram <njaya...@pivotal.io> wrote: > > Thank you for following up on that JIRA James. Based on some more code > > exploration, it looks like we should be able to replace the native > > implementation > > of array_dot() with Eigen's dot() function. array_dot() currently takes > in > > `anyarray` > > as you pointed out, and cosine_similarity() takes in double precision > > arrays. > > > > - But I was able to run cosine_similarity() on int[], float8[] and double > > precision[] > > vector pairs without any issues. > > - I also checked that the current array_dot() returns a float8, and not > > the type of the input arrays, while cosine_similarity() returns a double. > > - Internally in MADlib, a few modules (GLM, SVM, SVD, matrix_ops, and > > conjugate > > gradient) use the array_dot() function, and they too should not be > affected > > by > > this change. > > > > So it looks like there might not be any backward compatibility breaking > > changes if we replace the native array_dot() with Eigen's dot(). > > > > Testing locally, eigen array_dot is much faster for doubles, but > normal array_dot is a bit faster for float4. I don't have enough > knowledge of the internals of either postgres or madlib to say exactly > why this is. Maybe postgres is casting float4[] to float8[] when > calling postgres functions defined as taking doubles, or maybe > postgres itself doesn't cast but rather then internals of array_ops.c > are written in such a way as to be faster for float4 than the > internals of Eigen. > > But it seems that even if switching out the implementation totally > isn't actually a breaking change, it would cause a slight performance > degradation for people not using double precision. >