Michael McNeil Forbes <[email protected]> writes: > Here is the profile of the slow __call__. All the time is spent in > generate_stride_kernel_and_types: > > Line # Hits Time Per Hit % Time Line Contents > ============================================================== > 192 def __call__(self, > *args, **kwargs): > 193 78 145 1.9 0.1 vectors = [] > ... > 204 78 104 1.3 0.1 func, arguments = > self.generate_stride_kernel_and_types( > 205 78 199968 2563.7 97.3 range_ is > not None or slice_ is not None) > 206 > 207 156 354 2.3 0.2 for arg, arg_descr > in zip(args, arguments): > ... > 241 > 242 78 2780 35.6 1.4 > func.prepared_async_call(grid, block, stream, *invocation_args)
Now this is just confusing to me. generate_stride_kernel_and_types has a @memoize_method decorator, which should take care of caching the built kernel. Unless you're instantiating a new ElementwiseKernel for each call, generate_stride_kernel_and_types should only ever get called once. The default (cached) case should amount to one dictionary lookup, so I'm confused as to how that would eat up so much time. Can you perhaps create a small reproducer for this? Thanks, Andreas
pgpF6GV5YfgPS.pgp
Description: PGP signature
_______________________________________________ PyCUDA mailing list [email protected] http://lists.tiker.net/listinfo/pycuda
