Hi everybody, (just coming from a discussion on the performance of Matplotlib's (x)corr function which uses np.correlate)
There have been already many discussions on how to compute (cross-)correlations of time-series in Python (like http://stackoverflow.com/questions/6991471/computing-cross-correlation-function). The discussion is spread between the various stakeholders (just to name some I've in mind : scipy, statsmodel, matplotlib, ...). There are basically 2 implementations : time-domain and frequency-domain (using fft + multiplication). My discussion is only on time-domain implementation. The key usecase which I feel is not well adressed today is when computing the (cross)correlation of two long timeseries on only a few lagpoints. For time-domain, one can either write its own implementation or rely on numpy.correlate. The latter rely on the fast C implementation _pyarray_correlate() in multiarraymodule.c (https://github.com/numpy/numpy/blob/master/numpy/core/src/multiarray/multiarraymodule.c#L1177) Now, the signature of this function is meant to accept one of three *computation modes* ('valid', 'same', 'full'). Thoses modes make a lot of sense when using this function in the context of convoluting two signals (say an "input array" and a "impulse response array"). In the context of computing the (cross)correlation of two long timeseries on only a few lagpoints, those computation modes are unadapted and potentially leads to huge computational and memory wastes (for example : https://github.com/matplotlib/matplotlib/blob/master/lib/matplotlib/axes.py#L4332). For some time, I was thinking the solution was to write a dedicated function in one of the stakeholder modules (but which ? statsmodel ?) but now I came to think numpy is the best place to put a change and this would quickly benefit to all stackeholders downstream. Indeed, I looked more carefully at the C _pyarray_correlate() function, and I've come to the conclusion that these three computation modes are an unnecessary restriction because the actual computation relies on a triple for loop (https://github.com/numpy/numpy/blob/master/numpy/core/src/multiarray/multiarraymodule.c#L1177) which boundaries are governed by two integers `n_left` and `n_right` instead of the three modes. Maybe I've misunderstood the inner-working of this function, but I've the feeling that the ability to set these two integers directly instead of just the mode would open up the possibility to compute the correlation on only a few lagpoints. I'm fully aware that changing the signature of such a core numpy is out-of-question but I've got the feeling that a reasonably small some code refactor might lift the longgoing problem of (cross-)correlation computation. The Python np.correlate would require two additional keywords -`n_left` and `n_right`)which would override the `mode` keyword. Only the C function would need some more care to keep good backward compatibility in case it's used externally. What do other people think ? Best, Pierre
signature.asc
Description: OpenPGP digital signature
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion