Proposed new feature or change:

`np.cov` centers its input by allocating `m - mean(m)`, a copy the size of `m`. 
When observations vastly outnumber variables, that copy dominates peak memory 
and wall time; the covariance matrix itself is tiny.

I and potentially scikit-learn (cc. @ogrisel) would like to add an opt-in 
keyword that uses the algebraically equivalent post-hoc formula:

    cov = (m @ m.conj().T) / fact - (N / fact) * avg @ avg.conj().T

It is already implemented here on scikit-learn: 

https://github.com/scikit-learn/scikit-learn/blob/94af1dff678e8a7a6fc627b85a5489980c063331/sklearn/decomposition/_pca.py#L604-L610

I think this optimization should be put here, and we have many historical 
examples of this type of optimization on numpy. From the historical perspective:

1. In the PR #6403 (merged 2015), already halved `np.cov` memory; the ' m-mean` 
copy is what's left. 
2. #13199 tracks the same class of problem for `np.var` and `np.std` (open). 
3. #6231 proposes Welford's algorithm; that's orthogonal, because post-hoc 
pairs with BLAS `gemm`/`syrk` and Welford does not.

#1696, #9631, #4694 cover the stability/memory tradeoff for variance-like 
reductions.

Scikit-learn reimplements this locally and already shows strong benchmark 
results with this. `PCA(svd_solver="covariance_eigh")` does `C = X.T @ X - N * 
mean mean.T; C /= (N - 1)` ([source][sklearn-pca]) to avoid the copy. 
[scikit-learn#33573][] ([review comment][grisel-comment]) asks for this to move 
upstream.

The small trade-off is that the post-hoc formula loses precision when `|avg|^2 
/ var(m)` is large; the reference that discusses this is Chan, T. F., Golub, G. 
H., & LeVeque, R. J. (1983).

So the default stays safe. Callers who know their data is well-conditioned opt 
in.

## Proposal

    np.cov(m, ..., *, method="pre-center")   # or "post-hoc"

Weighted (`fweights`, `aweights`) and complex cases generalize without 
structural change. `corrcoef` could follow, but not in the first PR.

Chan, T. F., Golub, G. H., & LeVeque, R. J. (1983). Algorithms for computing 
the sample variance: Analysis and recommendations. The American Statistician, 
37(3), 242-247.

[#6403]: https://github.com/numpy/numpy/pull/6403
[#13199]: https://github.com/numpy/numpy/issues/13199
[#6231]: https://github.com/numpy/numpy/issues/6231
[#1696]: https://github.com/numpy/numpy/issues/1696
[#9631]: https://github.com/numpy/numpy/issues/9631
[#4694]: https://github.com/numpy/numpy/issues/4694
[sklearn-pca]: 
https://github.com/scikit-learn/scikit-learn/blob/94af1dff678e8a7a6fc627b85a5489980c063331/sklearn/decomposition/_pca.py#L604-L610
[scikit-learn#33573]: https://github.com/scikit-learn/scikit-learn/pull/33573
[grisel-comment]: 
https://github.com/scikit-learn/scikit-learn/pull/33573#discussion_r3101245545

cc. @thomasjpfan, @ogrisel, @qbarthelemy, @betatim, @rgommers, @lucascolley.

and FYI: @seberg and @eric-wieser
_______________________________________________
NumPy-Discussion mailing list -- [email protected]
To unsubscribe send an email to [email protected]
https://mail.python.org/mailman3//lists/numpy-discussion.python.org
Member address: [email protected]

Reply via email to