On Thu, Feb 21, 2013 at 1:00 PM, Pierre Haessig <[email protected]> wrote: > Hi everybody, > > (just coming from a discussion on the performance of Matplotlib's > (x)corr function which uses np.correlate) > > There have been already many discussions on how to compute > (cross-)correlations of time-series in Python (like > http://stackoverflow.com/questions/6991471/computing-cross-correlation-function). > The discussion is spread between the various stakeholders (just to name > some I've in mind : scipy, statsmodel, matplotlib, ...). > > There are basically 2 implementations : time-domain and frequency-domain > (using fft + multiplication). My discussion is only on time-domain > implementation. The key usecase which I feel is not well adressed today > is when computing the (cross)correlation of two long timeseries on only > a few lagpoints. > > For time-domain, one can either write its own implementation or rely on > numpy.correlate. The latter rely on the fast C implementation > _pyarray_correlate() in multiarraymodule.c > (https://github.com/numpy/numpy/blob/master/numpy/core/src/multiarray/multiarraymodule.c#L1177) > > Now, the signature of this function is meant to accept one of three > *computation modes* ('valid', 'same', 'full'). Thoses modes make a lot > of sense when using this function in the context of convoluting two > signals (say an "input array" and a "impulse response array"). In the > context of computing the (cross)correlation of two long timeseries on > only a few lagpoints, those computation modes are unadapted and > potentially leads to huge computational and memory wastes > (for example : > https://github.com/matplotlib/matplotlib/blob/master/lib/matplotlib/axes.py#L4332). > > > > For some time, I was thinking the solution was to write a dedicated > function in one of the stakeholder modules (but which ? statsmodel ?) > but now I came to think numpy is the best place to put a change and this > would quickly benefit to all stackeholders downstream. > > Indeed, I looked more carefully at the C _pyarray_correlate() function, > and I've come to the conclusion that these three computation modes are > an unnecessary restriction because the actual computation relies on a > triple for loop > (https://github.com/numpy/numpy/blob/master/numpy/core/src/multiarray/multiarraymodule.c#L1177) > which boundaries are governed by two integers `n_left` and `n_right` > instead of the three modes. > > Maybe I've misunderstood the inner-working of this function, but I've > the feeling that the ability to set these two integers directly instead > of just the mode would open up the possibility to compute the > correlation on only a few lagpoints. > > I'm fully aware that changing the signature of such a core numpy is > out-of-question but I've got the feeling that a reasonably small some > code refactor might lift the longgoing problem of (cross-)correlation > computation. The Python np.correlate would require two additional > keywords -`n_left` and `n_right`)which would override the `mode` > keyword. Only the C function would need some more care to keep good > backward compatibility in case it's used externally. > > > > What do other people think ?
I think it's a good idea. I didn't look at the implementation details. In statsmodels we also use explicit loops (x[lag:] * y[:lag]).sum(0) at places where we expect that in most cases we only need a few correlations (<10 or <20). (the fft and correlate versions are mainly used where we expect or need a large number or the full number of valid correlations) Josef > > Best, > Pierre > > > _______________________________________________ > NumPy-Discussion mailing list > [email protected] > http://mail.scipy.org/mailman/listinfo/numpy-discussion > _______________________________________________ NumPy-Discussion mailing list [email protected] http://mail.scipy.org/mailman/listinfo/numpy-discussion
