On Fri, 2012-11-09 at 14:52 -0800, Nicolas SCHEFFER wrote:
> Ok: comparing apples to apples. I'm clueless on my observations and
> would need input from you guys.
> 
> Using ATLAS 3.10, numpy with and without my changes, I'm getting these
> timings and comparisons.
> 
> #
> #I. Generate matrices using regular dot:
> #
> big = np.array(np.random.randn(2000, 2000), 'f');
> np.savez('out', big=big, none=big.dot(big), both=big.T.dot(big.T),
> left=big.T.dot(big), right=big.dot(big.T))"
> 
> #
> #II. Timings with regular dot
> #
> In [3]: %timeit np.dot(big, big)
> 10 loops, best of 3: 138 ms per loop
> 
> In [4]: %timeit np.dot(big, big.T)
> 10 loops, best of 3: 166 ms per loop
> 
> In [5]: %timeit np.dot(big.T, big.T)
> 10 loops, best of 3: 193 ms per loop
> 
> In [6]: %timeit np.dot(big.T, big)
> 10 loops, best of 3: 165 ms per loop
> 
> #
> #III. I load these arrays and time again with the "fast" dot
> #
> In [21]: %timeit np.dot(big, big)
> 10 loops, best of 3: 138 ms per loop
> 
> In [22]: %timeit np.dot(big.T, big)
> 10 loops, best of 3: 104 ms per loop
> 
> In [23]: %timeit np.dot(big.T, big.T)
> 10 loops, best of 3: 138 ms per loop
> 
> In [24]: %timeit np.dot(big, big.T)
> 10 loops, best of 3: 102 ms per loop
> 
> 1. A'.A': great!
> 2. A.A' becomes faster than A.A !?!
> 
> #
> #IV. MSE on differences
> #
> In [25]: np.sqrt(((arr['none'] - none)**2).sum())
> Out[25]: 0.0
> 
> In [26]: np.sqrt(((arr['both'] - both)**2).sum())
> Out[26]: 0.0
> 
> In [27]: np.sqrt(((arr['left'] - left)**2).sum())
> Out[27]: 0.015900515
> 
> In [28]: np.sqrt(((arr['right'] - right)**2).sum())
> Out[28]: 0.015331409
> 
> #
> # CCl
> #
> While the MSE are small, I'm wondering whether:
> - It's a bug: it should be exactly the same
> - It's a feature: BLAS is taking shortcuts when you have A.A'. The
> difference is not significant. Quick: PR that asap!
> 

I think its a feature because memory can be accessed in a more regular
fashion in the case of one array being Fortran and the other C
contiguous. IE. if its A.B' then fetching the first row is perfect,
since its all behind each other in memory (C-order), while it has to be
multiplied with the first column from B' which, due to being transposed,
is fortran order and the column thus also one chunk in memory.

> I don't have enough expertise to answer that...
> 
> Thanks much!
> 
> -nicolas
> On Fri, Nov 9, 2012 at 2:13 PM, Nicolas SCHEFFER
> <scheffer.nico...@gmail.com> wrote:
> > I too encourage users to use scipy.linalg for speed and robustness
> > (hence calling this scipy.dot), but it just brings so much confusion!
> > When using the scipy + numpy ecosystem, you'd almost want everything
> > be done with scipy so that you get the best implementation in all
> > cases: scipy.zeros(), scipy.array(), scipy.dot(), scipy.linalg.inv().
> >
> > Anyway this is indeed for another thread, the confusion we'd like to
> > fix here is that users shouldn't have to understand the C/F contiguous
> > concepts to get the maximum speed for np.dot()
> >
> > To summarize:
> > - The python snippet I posted is still valid and can speed up your
> > code if you can change all your dot() calls.
> > - The change in dotblas.c is a bit more problematic because it's very
> > core. I'm having issues right now to replicate the timings, I've got
> > better timing for a.dot(a.T) than for a.dot(a). There might be a bug.
> >
> > It's a pain to test because I cannot do the test in a single python session.
> > I'm going to try to integrate most of your suggestions, I cannot
> > guarantee I'll have time to do them all though.
> >
> > -nicolas
> > On Fri, Nov 9, 2012 at 8:56 AM, Nathaniel Smith <n...@pobox.com> wrote:
> >> On Fri, Nov 9, 2012 at 4:25 PM, Gael Varoquaux
> >> <gael.varoqu...@normalesup.org> wrote:
> >>> On Fri, Nov 09, 2012 at 03:12:42PM +0000, Nathaniel Smith wrote:
> >>>> But what if someone compiles numpy against an optimized blas (mkl,
> >>>> say) and then compiles SciPy against the reference blas? What do you
> >>>> do then!? ;-)
> >>>
> >>> This could happen. But the converse happens very often. What happens is
> >>> that users (eg on shared computing resource) ask for a scientific python
> >>> environment. The administrator than installs the package starting from
> >>> the most basic one, to the most advanced one, thus starting with numpy
> >>> that can very well build without any external blas. When he gets to scipy
> >>> he hits the problem that the build system does not detect properly the
> >>> blas, and he solves that problem.
> >>>
> >>> Also, it used to be that on the major linux distributions, numpy would not
> >>> be build with an optimize lapack because numpy was in the 'base' set of
> >>> packages, but not lapack. On the contrary, scipy being in the 'contrib'
> >>> set, it could depend on lapack. I just checked, and this has been fixed
> >>> in the major distributions (Fedora, Debian, Ubuntu).
> >>>
> >>> Now we can discuss with such problems should not happen, and put the
> >>> blame on the users/administrators, the fact is that they happen often. I
> >>> keep seeing environments in which np.linalg is unreasonnably slow.
> >>
> >> If this is something that's been a problem for you, maybe we should
> >> start another thread on things we could do to fix it directly? Improve
> >> build instructions, advertise build systems that set up the whole
> >> environment (and thus do the right thing), make numpy's setup.py
> >> scream and yell if blas isn't available...?
> >>
> >> -n
> >> _______________________________________________
> >> NumPy-Discussion mailing list
> >> NumPy-Discussion@scipy.org
> >> http://mail.scipy.org/mailman/listinfo/numpy-discussion
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
> 


_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Reply via email to