On Fri, 2012-11-09 at 14:52 -0800, Nicolas SCHEFFER wrote: > Ok: comparing apples to apples. I'm clueless on my observations and > would need input from you guys. > > Using ATLAS 3.10, numpy with and without my changes, I'm getting these > timings and comparisons. > > # > #I. Generate matrices using regular dot: > # > big = np.array(np.random.randn(2000, 2000), 'f'); > np.savez('out', big=big, none=big.dot(big), both=big.T.dot(big.T), > left=big.T.dot(big), right=big.dot(big.T))" > > # > #II. Timings with regular dot > # > In [3]: %timeit np.dot(big, big) > 10 loops, best of 3: 138 ms per loop > > In [4]: %timeit np.dot(big, big.T) > 10 loops, best of 3: 166 ms per loop > > In [5]: %timeit np.dot(big.T, big.T) > 10 loops, best of 3: 193 ms per loop > > In [6]: %timeit np.dot(big.T, big) > 10 loops, best of 3: 165 ms per loop > > # > #III. I load these arrays and time again with the "fast" dot > # > In [21]: %timeit np.dot(big, big) > 10 loops, best of 3: 138 ms per loop > > In [22]: %timeit np.dot(big.T, big) > 10 loops, best of 3: 104 ms per loop > > In [23]: %timeit np.dot(big.T, big.T) > 10 loops, best of 3: 138 ms per loop > > In [24]: %timeit np.dot(big, big.T) > 10 loops, best of 3: 102 ms per loop > > 1. A'.A': great! > 2. A.A' becomes faster than A.A !?! > > # > #IV. MSE on differences > # > In [25]: np.sqrt(((arr['none'] - none)**2).sum()) > Out[25]: 0.0 > > In [26]: np.sqrt(((arr['both'] - both)**2).sum()) > Out[26]: 0.0 > > In [27]: np.sqrt(((arr['left'] - left)**2).sum()) > Out[27]: 0.015900515 > > In [28]: np.sqrt(((arr['right'] - right)**2).sum()) > Out[28]: 0.015331409 > > # > # CCl > # > While the MSE are small, I'm wondering whether: > - It's a bug: it should be exactly the same > - It's a feature: BLAS is taking shortcuts when you have A.A'. The > difference is not significant. Quick: PR that asap! >
I think its a feature because memory can be accessed in a more regular fashion in the case of one array being Fortran and the other C contiguous. IE. if its A.B' then fetching the first row is perfect, since its all behind each other in memory (C-order), while it has to be multiplied with the first column from B' which, due to being transposed, is fortran order and the column thus also one chunk in memory. > I don't have enough expertise to answer that... > > Thanks much! > > -nicolas > On Fri, Nov 9, 2012 at 2:13 PM, Nicolas SCHEFFER > <scheffer.nico...@gmail.com> wrote: > > I too encourage users to use scipy.linalg for speed and robustness > > (hence calling this scipy.dot), but it just brings so much confusion! > > When using the scipy + numpy ecosystem, you'd almost want everything > > be done with scipy so that you get the best implementation in all > > cases: scipy.zeros(), scipy.array(), scipy.dot(), scipy.linalg.inv(). > > > > Anyway this is indeed for another thread, the confusion we'd like to > > fix here is that users shouldn't have to understand the C/F contiguous > > concepts to get the maximum speed for np.dot() > > > > To summarize: > > - The python snippet I posted is still valid and can speed up your > > code if you can change all your dot() calls. > > - The change in dotblas.c is a bit more problematic because it's very > > core. I'm having issues right now to replicate the timings, I've got > > better timing for a.dot(a.T) than for a.dot(a). There might be a bug. > > > > It's a pain to test because I cannot do the test in a single python session. > > I'm going to try to integrate most of your suggestions, I cannot > > guarantee I'll have time to do them all though. > > > > -nicolas > > On Fri, Nov 9, 2012 at 8:56 AM, Nathaniel Smith <n...@pobox.com> wrote: > >> On Fri, Nov 9, 2012 at 4:25 PM, Gael Varoquaux > >> <gael.varoqu...@normalesup.org> wrote: > >>> On Fri, Nov 09, 2012 at 03:12:42PM +0000, Nathaniel Smith wrote: > >>>> But what if someone compiles numpy against an optimized blas (mkl, > >>>> say) and then compiles SciPy against the reference blas? What do you > >>>> do then!? ;-) > >>> > >>> This could happen. But the converse happens very often. What happens is > >>> that users (eg on shared computing resource) ask for a scientific python > >>> environment. The administrator than installs the package starting from > >>> the most basic one, to the most advanced one, thus starting with numpy > >>> that can very well build without any external blas. When he gets to scipy > >>> he hits the problem that the build system does not detect properly the > >>> blas, and he solves that problem. > >>> > >>> Also, it used to be that on the major linux distributions, numpy would not > >>> be build with an optimize lapack because numpy was in the 'base' set of > >>> packages, but not lapack. On the contrary, scipy being in the 'contrib' > >>> set, it could depend on lapack. I just checked, and this has been fixed > >>> in the major distributions (Fedora, Debian, Ubuntu). > >>> > >>> Now we can discuss with such problems should not happen, and put the > >>> blame on the users/administrators, the fact is that they happen often. I > >>> keep seeing environments in which np.linalg is unreasonnably slow. > >> > >> If this is something that's been a problem for you, maybe we should > >> start another thread on things we could do to fix it directly? Improve > >> build instructions, advertise build systems that set up the whole > >> environment (and thus do the right thing), make numpy's setup.py > >> scream and yell if blas isn't available...? > >> > >> -n > >> _______________________________________________ > >> NumPy-Discussion mailing list > >> NumPy-Discussion@scipy.org > >> http://mail.scipy.org/mailman/listinfo/numpy-discussion > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion@scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion