Hello,
First a quick summary of my problem and at the end I include the basic
changes I am suggesting to the source (they may benefit others)
I am ages behind in times and I am still using Numeric in Python 2.2.3.
The main reason why it has taken so long to upgrade is because NumPy
kills performance on several of my tests.
I am sorry if this topic has been discussed before. I tried parsing the
mailing list and also google and all I found were comments related to
the fact that such is life when you use NumPy for small arrays.
In my case I have several thousands of lines of code where data
structures rely heavily on Numeric arrays but it is unpredictable if the
problem at hand will result in large or small arrays. Furthermore, once
the vectorized operations complete, the values could be assigned into
scalars and just do simple math or loops. I am fairly sure the core of
my problems is that the 'float64' objects start propagating all over the
program data structures (not in arrays) and they are considerably slower
for just about everything when compared to the native python float.
Conclusion, it is not practical for me to do a massive re-structuring of
code to improve speed on simple things like "a[0] < 4" (assuming "a" is
an array) which is about 10 times slower than "b < 4" (assuming "b" is a
float)
I finally decided to track down the problem and I started by getting
Python 2.6 from source and profiling it in one of my cases. By far the
biggest bottleneck came out to be PyString_FromFormatV which is a
function to assemble a string for a Python error caused by a failure to
find an attribute when "multiarray" calls PyObject_GetAttrString. This
function seems to get called way too often from NumPy. The real
bottleneck of trying to find the attribute when it does not exist is not
that it fails to find it, but that it builds a string to set a Python
error. In other words, something as simple as "a[0] < 3.5" internally
result in a call to set a python error .
I downloaded NumPy code (for Python 2.6) and tracked down all the calls
like this,
ret = PyObject_GetAttrString(obj, "__array_priority__");
and changed to
if (PyList_CheckExact(obj) || (Py_None == obj) ||
PyTuple_CheckExact(obj) ||
PyFloat_CheckExact(obj) ||
PyInt_CheckExact(obj) ||
PyString_CheckExact(obj) ||
PyUnicode_CheckExact(obj)){
//Avoid expensive calls when I am sure the attribute
//does not exist
ret = NULL;
}
else{
ret = PyObject_GetAttrString(obj, "__array_priority__");
( I think I found about 7 spots )
I also noticed (not as bad in my case) that calls to PyObject_GetBuffer
also resulted in Python errors being set thus unnecessarily slower code.
With this change, something like this,
for i in xrange(1000000):
if a[1] < 35.0:
pass
went down from 0.8 seconds to 0.38 seconds.
A bogus test like this,
for i in xrange(1000000):
a = array([1., 2., 3.])
went down from 8.5 seconds to 2.5 seconds.
Altogether, these simple changes got me half way to the speed I used to
get in Numeric and I could not see any slow down in any of my cases that
benefit from heavy array manipulation. I am out of ideas on how to
improve further though.
Few questions:
- Is there any interest for me to provide the exact details of the code
I changed ?
- I managed to compile NumPy through setup.py but I am not sure how to
force it to generate pdb files from my Visual Studio Compiler. I need
the pdb files such that I can run my profiler on NumPy. Anybody has any
experience with this ? (Visual Studio)
- The core of my problems I think boil down to things like this
s = a[0]
assigning a float64 into s as opposed to a native float ?
Is there any way to hack code to change it to extract a native float
instead ? (probably crazy talk, but I thought I'd ask :) ).
I'd prefer to not use s = a.item(0) because I would have to change too
much code and it is not even that much faster. For example,
for i in xrange(1000000):
if a.item(1) < 35.0:
pass
is 0.23 seconds (as opposed to 0.38 seconds with my suggested changes)
I apologize again if this topic has already been discussed.
Regards,
Raul
_______________________________________________
NumPy-Discussion mailing list
[email protected]
http://mail.scipy.org/mailman/listinfo/numpy-discussion