Thanks for taking the time to think about this; good work. Personally, I don't think a factor 5 memory overhead is much to sweat over. The most complex einsum I have ever needed in a production environment was 5/6 terms, and for what this anecdote is worth, speed was a far bigger concern to me than memory.
On Fri, Jan 16, 2015 at 12:30 AM, Daniel Smith <[email protected]> wrote: > Hello everyone, > I originally brought an optimized einsum routine forward a few weeks back > that attempts to contract numpy arrays together in an optimal way. This can > greatly reduce the scaling and overall cost of the einsum expression for > the cost of a few intermediate arrays. The current version (and more > details) can be found here: > https://github.com/dgasmith/opt_einsum > > I think this routine is close to a finalized version, but there are two > problems which I would like the community to weigh in on: > > Memory usage- > Currently the routine only considers the maximum size of intermediates > created rather than cumulative memory usage. If we only use np.einsum it is > straightforward to calculate cumulative memory usage as einsum does not > require any copies of inputs to be made; however, if we attempt to use a > vendor BLAS the memory usage can becomes quite complex. For example, the > np.tensordot routine always forces a copy for ndarrays because it uses > ndarray.transpose(…).reshape(…). A more memory-conscious way to do this is > to first try and do ndarray.reshape(…).T which does not copy the data and > numpy can just pass a transpose flag to the vendor BLAS. The caveat here is > that the summed indices must be in the correct order- if not a copy is > required. Maximum array size is usually close to the total overhead of the > opt_einsum function, but can occasionally be 2-5 times this size. I see the > following ways forward: > > - Ignore cumulative memory and stick with maximum array size as the > limiting memory factor. > - Implement logic to figure out if the input arrays needs to be copied > to use BLAS, compute the extra memory required, and add an extra dimension > to the optimization algorithms (use BLAS or do not use BLAS at each step). > Some of this is already done, but may get fairly complex. > - Build an in-place nd-transpose algorithm. > - Cut out BLAS entirely. Keeping in mind that vendor BLAS can be > orders of magnitude faster than a pure einsum implementation, especially if > the BLAS threading is used. > > > Path algorithms- > There are two algorithms “optimal” (a brute force algorithm, scales like > N!) and “opportunistic” (a greedy algorithm, scales like N^3). The optimal > path can take seconds to minutes to calculate for a 7-10 term expression > while the opportunistic path takes microseconds even for 20+ term > expressions. The opportunistic algorithm works surprisingly well and > appears to obtain the correct scaling in all test cases that I can think > of. Both algorithms use the maximum array size as a sieve- this is very > beneficial from several aspects. The problem occurs when a much needed > intermediate cannot be created due to this limit- on occasions not making > this intermediate can have slowdowns of orders of magnitude even for small > systems. This leads to certain (and sometimes unexpected) performance > walls. Possible solutions: > > - Warn the user if the ratio between an unlimited memory solution and > a limited memory solution becomes large. > - Do not worry about it. > > > Thank you for your time, > -Daniel Smith > > > _______________________________________________ > NumPy-Discussion mailing list > [email protected] > http://mail.scipy.org/mailman/listinfo/numpy-discussion > >
_______________________________________________ NumPy-Discussion mailing list [email protected] http://mail.scipy.org/mailman/listinfo/numpy-discussion
