Proposed new feature or change:
`np.cov` centers its input by allocating `m - mean(m)`, a copy the size of `m`.
When observations vastly outnumber variables, that copy dominates peak memory
and wall time; the covariance matrix itself is tiny.
I and potentially scikit-learn (cc. @ogrisel) would like to add an opt-in
keyword that uses the algebraically equivalent post-hoc formula:
cov = (m @ m.conj().T) / fact - (N / fact) * avg @ avg.conj().T
It is already implemented here on scikit-learn:
https://github.com/scikit-learn/scikit-learn/blob/94af1dff678e8a7a6fc627b85a5489980c063331/sklearn/decomposition/_pca.py#L604-L610
I think this optimization should be put here, and we have many historical
examples of this type of optimization on numpy. From the historical perspective:
1. In the PR #6403 (merged 2015), already halved `np.cov` memory; the ' m-mean`
copy is what's left.
2. #13199 tracks the same class of problem for `np.var` and `np.std` (open).
3. #6231 proposes Welford's algorithm; that's orthogonal, because post-hoc
pairs with BLAS `gemm`/`syrk` and Welford does not.
#1696, #9631, #4694 cover the stability/memory tradeoff for variance-like
reductions.
Scikit-learn reimplements this locally and already shows strong benchmark
results with this. `PCA(svd_solver="covariance_eigh")` does `C = X.T @ X - N *
mean mean.T; C /= (N - 1)` ([source][sklearn-pca]) to avoid the copy.
[scikit-learn#33573][] ([review comment][grisel-comment]) asks for this to move
upstream.
The small trade-off is that the post-hoc formula loses precision when `|avg|^2
/ var(m)` is large; the reference that discusses this is Chan, T. F., Golub, G.
H., & LeVeque, R. J. (1983).
So the default stays safe. Callers who know their data is well-conditioned opt
in.
## Proposal
np.cov(m, ..., *, method="pre-center") # or "post-hoc"
Weighted (`fweights`, `aweights`) and complex cases generalize without
structural change. `corrcoef` could follow, but not in the first PR.
Chan, T. F., Golub, G. H., & LeVeque, R. J. (1983). Algorithms for computing
the sample variance: Analysis and recommendations. The American Statistician,
37(3), 242-247.
[#6403]: https://github.com/numpy/numpy/pull/6403
[#13199]: https://github.com/numpy/numpy/issues/13199
[#6231]: https://github.com/numpy/numpy/issues/6231
[#1696]: https://github.com/numpy/numpy/issues/1696
[#9631]: https://github.com/numpy/numpy/issues/9631
[#4694]: https://github.com/numpy/numpy/issues/4694
[sklearn-pca]:
https://github.com/scikit-learn/scikit-learn/blob/94af1dff678e8a7a6fc627b85a5489980c063331/sklearn/decomposition/_pca.py#L604-L610
[scikit-learn#33573]: https://github.com/scikit-learn/scikit-learn/pull/33573
[grisel-comment]:
https://github.com/scikit-learn/scikit-learn/pull/33573#discussion_r3101245545
cc. @thomasjpfan, @ogrisel, @qbarthelemy, @betatim, @rgommers, @lucascolley.
and FYI: @seberg and @eric-wieser
_______________________________________________
NumPy-Discussion mailing list -- [email protected]
To unsubscribe send an email to [email protected]
https://mail.python.org/mailman3//lists/numpy-discussion.python.org
Member address: [email protected]