Okay, my bad. I was only looping 40 times, so the initial application was was eating all the time. Iterating 4000 times through the loop gives more reasonable per-function calls -- @memoize_method is indeed working.
That being said, having a way to explicitly prepare the function before using it can be helpful. One use-case is to facilitate profiling loops...:-) Sorry for the red-herring. Michael. On Jul 19, 2013, at 11:50 AM, Andreas Kloeckner <[email protected]> wrote: > Michael McNeil Forbes <[email protected]> writes: >> Here is the profile of the slow __call__. All the time is spent in >> generate_stride_kernel_and_types: >> >> Line # Hits Time Per Hit % Time Line Contents >> ============================================================== >> 192 def __call__(self, >> *args, **kwargs): >> 193 78 145 1.9 0.1 vectors = [] >> ... >> 204 78 104 1.3 0.1 func, arguments = >> self.generate_stride_kernel_and_types( >> 205 78 199968 2563.7 97.3 range_ is >> not None or slice_ is not None) >> 206 >> 207 156 354 2.3 0.2 for arg, arg_descr >> in zip(args, arguments): >> ... >> 241 >> 242 78 2780 35.6 1.4 >> func.prepared_async_call(grid, block, stream, *invocation_args) > > Now this is just confusing to me. generate_stride_kernel_and_types has a > @memoize_method decorator, which should take care of caching the built > kernel. Unless you're instantiating a new ElementwiseKernel for each > call, generate_stride_kernel_and_types should only ever get called > once. The default (cached) case should amount to one dictionary lookup, > so I'm confused as to how that would eat up so much time. Can you > perhaps create a small reproducer for this? > > Thanks, > Andreas _______________________________________________ PyCUDA mailing list [email protected] http://lists.tiker.net/listinfo/pycuda
