OK I'm going to give the __new__ hack a try from 
http://trac.cython.org/cython_trac/ticket/238
I don't really need to overload __new__ do I, so I don't have to change 
matrix.pxd?

Just for completeness here is a recent profile on a very fast machine.  (All 
the other machines take 10 times as long. Hmm
they weren't running cython 0.12 either but 0.11.3)

         1155781 function calls in 11.978 CPU seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   107520    2.928    0.000    3.009    0.000 matrix.pyx:569(__cinit__)
     1024    2.223    0.002    2.632    0.003 matrix.pyx:1144(qr)
   107516    2.009    0.000    2.091    0.000 matrix.pyx:604(__dealloc__)
    40640    1.536    0.000    1.536    0.000 matrix.pyx:83(__dealloc__)
    40640    1.170    0.000    1.176    0.000 matrix.pyx:37(__cinit__)
     9920    0.337    0.000    0.394    0.000 matrix.pyx:1361(tomat)
     3584    0.193    0.000    4.395    0.001 muddect.py:92(sigdelay)
    19008    0.127    0.000    0.331    0.000 matrix.pyx:897(__mul__)
     1792    0.076    0.000    3.669    0.002 muddect.py:151(despread)
###  The calls to the hash table implemented via python dict are below, almost 
two order of magnitudes less expensive than the
###  matrix.pyx cinit in line 1. ####################
    90304    0.074    0.000    0.074    0.000 memcache.pyx:56(hashcget)
    91260    0.065    0.000    0.065    0.000 memcache.pyx:19(__cinit__)
     3584    0.056    0.000    0.285    0.000 matrix.pyx:1050(fft)
     3584    0.055    0.000    1.429    0.000 matrix.pyx:344(expj)
     3584    0.054    0.000    0.285    0.000 matrix.pyx:1071(ifft)
     7168    0.052    0.000    0.318    0.000 matrix.pyx:1324(r2cmat)
     7168    0.052    0.000    0.620    0.000 matrix.pyx:861(__and__)
  ...

To give some perspective, my matlab code performing the same operation, is 
running on octave on a single cpu 
and takes about the same amount of time (13 seconds.).  This cython application 
is farmed out to a GPU with at least 240 processesors on it.  I'd like to see 
well over an order of magnitude improvement.  With a few exceptions
the lions share of the cython processing is being sucked away on lousy memory 
allocations.  Even my qr decomposition is also getting
killed by allocations and memory transfers I believe.

Here is the __cinit__() and dealloc that dominates the time:
cdef class cmat :
    """  Real matrix class binds to both vsipl and numpy ndarray.
    The ndarray object is returned via getarr.  The array is bound
    back to vsipl via admit """

    def __cinit__(self, int r, int c):
        cdef vsip_cblock_f *myb
        cdef int istrue
        cdef np.ndarray[cfloat, ndim=2] carr
        cdef tuple t = (r,c)
        cdef bool isinhash

      
        self.parent = None # assumes not a submatrix
        if r <= 0 :
            self.vm = NULL
            return
        isinhash = hashcget( r, c, self)
        if isinhash :
            # already done loading
            return
        # print "create an nd array float buffer"
        carr = np.empty(shape=(r,c),dtype=np.complex64,order='Fortran')
        # print "assigning to arr"
        self.arr =<object> carr
   
        # print " bind the ndarray to a vsipl block"
        myb = vsip_cblockbind_f( <vsip_scalar_f *> carr.data, NULL, r * c, 
VSIP_MEM_NONE)
        if myb == NULL:
            raise MemoryError("Could not allocate real matrix block")
        # print " admit the data so that vsipl owns it"
        istrue = vsip_cblockadmit_f( myb, 0)
        # create a view
        self.vm = vsip_cmbind_f( myb, 0, 1, r, r, c)
        if self.vm == NULL:
            raise MemoryError("Could not allocate real matrix view")
        self.fftdata = NULL


    def __dealloc__(self):
        # print "freeing vm "
        cdef hashcmat hc
        cdef bool isin = True
        if (self.vm) :
            if self.parent is None :
                # add this to the cache if it isn't already
                hc = hashcmat(self)
                if not hc.isin :
                    # saving it in the hash instead
                    return
                vsip_cmalldestroy_f(self.vm)
            else :
                vsip_cmdestroy_f(self.vm)
        # print "freeing fftdata"
        if (self.fftdata) :
            vsip_fft_destroy_f(self.fftdata)    
 


The vsipl vendor tells me that the only real expensive operation is the 
cblockbind() within __cinit__().  However the __dealloc()__ routine
is very expensive as well  (Given the number of times it's being called).  It 
would be nice if I could profile on a line by
line basis.  I'm not sure if the python cProfile tool supports this or not.  I 
can't just chalk up this result to the vsipl code, since the hash routine is 
not giving me any performance gain and seemed to be making things worse.  
(Though I probably need to do some more debugging to see if I have a lot of 
cache misses,or some bug in my logic.)

For the life of me I could not figure out how to just put the matrix object 
itself into my hash indexed memory cache. It seemed like my python objects were 
always being garbage collected  once I hit the __dealloc__ routine (the 
self.arr ndarray as an example).  Later I found out that the cython class get's 
stripped of it's attributes if it's stored in a dictionary.  Only those 
attributes written to the classes internal dictionary in the __init__() method 
seem to get saved, as far as I can tell from my experiments.  Of course I'd 
actually like to avoid calling __init__().  I really didn't intend to try to 
learn the internals of python or cython for that matter,  but I do need to 
figure
out how to optimize this code.

Regards,
-Matt


Robert Bradshaw <rober...@...> writes:

> 
> On Jan 7, 2010, at 8:59 AM, Matthew wrote:
> 
> > Well I guess I outsmarted myself on this one.  After implementing my  
> > object hasher using dict,
> > my code slowed down by nearly a factor of 10 LOL.
> >
> > The hashing code itself shows up as a minimal contribution to the  
> > overall time in the profiler.
> > However it does require a little extra logic in the initializer and  
> > a copy of a few pointers.
> >
> > Basically I copy all the C device pointers to a new cython class  
> > which then get's saved in a dictionary upon the deallocation
> > of a matrix object.  If that object exists in the dictionary it is  
> > given to the allocator instead of allocating a new block.
> >
> > I'm wondering just what kind of overhead Im up against with regards  
> > to the python initialization of my cython defined class?
> > This almost suggests that the overhead problem is actually in python/ 
> > cython and not just in my C code.
> 
> This could probably be discovered via some profiling. If this is the  
> case, take a look at
> 
> http://trac.cython.org/cython_trac/ticket/238 (the modern version of  
> the former PY_NEW trick). Maybe you're passing lots of (keyword?)  
> arguments?
> 
> >
> > -Matt
> >
> > Robert Bradshaw <rober...@...> writes:
> >
> >>
> >> On Jan 7, 2010, at 12:05 AM, Stefan Behnel wrote:
> >>
> >>> Matthew Bromberg, 06.01.2010 21:50:
> >>>> How does tuple or list compare speed wise with dict?
> >>>
> >>> Like apples and oranges, basically.
> >>
> >> If you're trying to index into it with an int, especially a c int,
> >> lists and tuples will be much faster.
> >>
> >>>> Ultimately I have to hash into my list using size information.
> >>>
> >>> Any specific reason why you /can't/ use a dict?
> >>
> >> Which will likely be just as fast, if not faster, than hashing into a
> >> list manually yourself.
> >>
> >>>> This also still does not address my confusion with regards to how  
> >>>> to
> >>>> capture a python object before it get's destroyed.
> >>>
> >>> As long as there is a reference to it (e.g. in the hash table), it
> >>> won't
> >>> get deallocated. So: use a Python list for your hash table, stop
> >>> caring
> >>> about ref-counts and it will just work.
> >>
> >> +1. Ideally, you should never have to worry about reference counts
> >> when working with Cython at all.
> >>
> >> I'm still not quite sure exactly what you're trying to do, but if  
> >> it's
> >> creating and deleting thousands of these objects a second and that's
> >> killing you (the actual allocation/deallocation, not the
> >> initialization) then what you might want to do is something like
> >>
> >> http://hg.sagemath.org/sage-main/file/21efb0b3fc47/sage/rings/real_double.pyx#l2260
> >>
> >> which is a bit hackish and will probably need to be adapted to your
> >> specific case. If initializing is expensive, than you can probably
> >> keep around a pool of initalized pointers/buffers/whatever, and have
> >> the object creation just set/unset these fields (much cleaner).
> >>
> >> - Robert
> >>
> > _______________________________________________
> > Cython-dev mailing list
> > cython-...@...
> > http://codespeak.net/mailman/listinfo/cython-dev
> 
> 



_______________________________________________
Cython-dev mailing list
[email protected]
http://codespeak.net/mailman/listinfo/cython-dev

Reply via email to