Hello everyone,
I originally brought an optimized einsum routine forward a few weeks back that 
attempts to contract numpy arrays together in an optimal way. This can greatly 
reduce the scaling and overall cost of the einsum expression for the cost of a 
few intermediate arrays. The current version (and more details) can be found 
here:
https://github.com/dgasmith/opt_einsum <https://github.com/dgasmith/opt_einsum>

I think this routine is close to a finalized version, but there are two 
problems which I would like the community to weigh in on:

Memory usage- 
Currently the routine only considers the maximum size of intermediates created 
rather than cumulative memory usage. If we only use np.einsum it is 
straightforward to calculate cumulative memory usage as einsum does not require 
any copies of inputs to be made; however, if we attempt to use a vendor BLAS 
the memory usage can becomes quite complex. For example, the np.tensordot 
routine always forces a copy for ndarrays because it uses 
ndarray.transpose(…).reshape(…). A more memory-conscious way to do this is to 
first try and do ndarray.reshape(…).T which does not copy the data and numpy 
can just pass a transpose flag to the vendor BLAS. The caveat here is that the 
summed indices must be in the correct order- if not a copy is required. Maximum 
array size is usually close to the total overhead of the opt_einsum function, 
but can occasionally be 2-5 times this size. I see the following ways forward:
Ignore cumulative memory and stick with maximum array size as the limiting 
memory factor.
Implement logic to figure out if the input arrays needs to be copied to use 
BLAS, compute the extra memory required, and add an extra dimension to the 
optimization algorithms (use BLAS or do not use BLAS at each step). Some of 
this is already done, but may get fairly complex.
Build an in-place nd-transpose algorithm.
Cut out BLAS entirely. Keeping in mind that vendor BLAS can be orders of 
magnitude faster than a pure einsum implementation, especially if the BLAS 
threading is used.

Path algorithms-
There are two algorithms “optimal” (a brute force algorithm, scales like N!) 
and “opportunistic” (a greedy algorithm, scales like N^3). The optimal path can 
take seconds to minutes to calculate for a 7-10 term expression while the 
opportunistic path takes microseconds even for 20+ term expressions. The 
opportunistic algorithm works surprisingly well and appears to obtain the 
correct scaling in all test cases that I can think of. Both algorithms use the 
maximum array size as a sieve- this is very beneficial from several aspects. 
The problem occurs when a much needed intermediate cannot be created due to 
this limit- on occasions not making this intermediate can have slowdowns of 
orders of magnitude even for small systems. This leads to certain (and 
sometimes unexpected) performance walls. Possible solutions:
Warn the user if the ratio between an unlimited memory solution and a limited 
memory solution becomes large.
Do not worry about it.

Thank you for your time,
-Daniel Smith

_______________________________________________
NumPy-Discussion mailing list
[email protected]
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Reply via email to