Hi, On Sat, Jun 16, 2012 at 8:03 PM, Matthew Brett <matthew.br...@gmail.com> wrote: > Hi, > > On Sat, Jun 16, 2012 at 10:40 AM, Nathaniel Smith <n...@pobox.com> wrote: >> On Fri, Jun 15, 2012 at 4:10 AM, Charles R Harris >> <charlesr.har...@gmail.com> wrote: >>> >>> >>> On Thu, Jun 14, 2012 at 8:06 PM, Matthew Brett <matthew.br...@gmail.com> >>> wrote: >>>> >>>> Hi, >>>> >>>> I noticed that numpy.linalg.matrix_rank sometimes gives full rank for >>>> matrices that are numerically rank deficient: >>>> >>>> If I repeatedly make random matrices, then set the first column to be >>>> equal to the sum of the second and third columns: >>>> >>>> def make_deficient(): >>>> X = np.random.normal(size=(40, 10)) >>>> deficient_X = X.copy() >>>> deficient_X[:, 0] = deficient_X[:, 1] + deficient_X[:, 2] >>>> return deficient_X >>>> >>>> then the current numpy.linalg.matrix_rank algorithm returns full rank >>>> (10) in about 8 percent of cases (see appended script). >>>> >>>> I think this is a tolerance problem. The ``matrix_rank`` algorithm >>>> does this by default: >>>> >>>> S = spl.svd(M, compute_uv=False) >>>> tol = S.max() * np.finfo(S.dtype).eps >>>> return np.sum(S > tol) >>>> >>>> I guess we'd we want the lowest tolerance that nearly always or always >>>> identifies numerically rank deficient matrices. I suppose one way of >>>> looking at whether the tolerance is in the right range is to compare >>>> the calculated tolerance (``tol``) to the minimum singular value >>>> (``S.min()``) because S.min() in our case should be very small and >>>> indicate the rank deficiency. The mean value of tol / S.min() for the >>>> current algorithm, across many iterations, is about 2.8. We might >>>> hope this value would be higher than 1, but not much higher, otherwise >>>> we might be rejecting too many columns. >>>> >>>> Our current algorithm for tolerance is the same as the 2-norm of M * >>>> eps. We're citing Golub and Van Loan for this, but now I look at our >>>> copy (p 261, last para) - they seem to be suggesting using u * |M| >>>> where u = (p 61, section 2.4.2) eps / 2. (see [1]). I think the Golub >>>> and Van Loan suggestion corresponds to: >>>> >>>> tol = np.linalg.norm(M, np.inf) * np.finfo(M.dtype).eps / 2 >>>> >>>> This tolerance gives full rank for these rank-deficient matrices in >>>> about 39 percent of cases (tol / S.min() ratio of 1.7) >>>> >>>> We see on p 56 (section 2.3.2) that: >>>> >>>> m, n = M.shape >>>> 1 / sqrt(n) . |M|_{inf} <= |M|_2 >>>> >>>> So we can get an upper bound on |M|_{inf} with |M|_2 * sqrt(n). Setting: >>>> >>>> tol = S.max() * np.finfo(M.dtype).eps / 2 * np.sqrt(n) >>>> >>>> gives about 0.5 percent error (tol / S.min() of 4.4) >>>> >>>> Using the Mathworks threshold [2]: >>>> >>>> tol = S.max() * np.finfo(M.dtype).eps * max((m, n)) >>>> >>>> There are no false negatives (0 percent rank 10), but tol / S.min() is >>>> around 110 - so conservative, in this case. >>>> >>>> So - summary - I'm worrying our current threshold is too small, >>>> letting through many rank-deficient matrices without detection. I may >>>> have misread Golub and Van Loan, but maybe we aren't doing what they >>>> suggest. Maybe what we could use is either the MATLAB threshold or >>>> something like: >>>> >>>> tol = S.max() * np.finfo(M.dtype).eps * np.sqrt(n) >>>> >>>> - so 2 * the upper bound for the inf norm = 2 * |M|_2 * sqrt(n) . This >>>> gives 0 percent misses and tol / S.min() of 8.7. >>>> >>>> What do y'all think? >>>> >>>> Best, >>>> >>>> Matthew >>>> >>>> [1] >>>> http://matthew-brett.github.com/pydagogue/floating_error.html#machine-epsilon >>>> [2] http://www.mathworks.com/help/techdoc/ref/rank.html >>>> >>>> Output from script: >>>> >>>> Percent undetected current: 9.8, tol / S.min(): 2.762 >>>> Percent undetected inf norm: 39.1, tol / S.min(): 1.667 >>>> Percent undetected upper bound inf norm: 0.5, tol / S.min(): 4.367 >>>> Percent undetected upper bound inf norm * 2: 0.0, tol / S.min(): 8.734 >>>> Percent undetected MATLAB: 0.0, tol / S.min(): 110.477 >>>> >>>> <script> >>>> import numpy as np >>>> import scipy.linalg as npl >>>> >>>> M = 40 >>>> N = 10 >>>> >>>> def make_deficient(): >>>> X = np.random.normal(size=(M, N)) >>>> deficient_X = X.copy() >>>> if M > N: # Make a column deficient >>>> deficient_X[:, 0] = deficient_X[:, 1] + deficient_X[:, 2] >>>> else: # Make a row deficient >>>> deficient_X[0] = deficient_X[1] + deficient_X[2] >>>> return deficient_X >>>> >>>> matrices = [] >>>> ranks = [] >>>> ranks_inf = [] >>>> ranks_ub_inf = [] >>>> ranks_ub2_inf = [] >>>> ranks_mlab = [] >>>> tols = np.zeros((1000, 6)) >>>> for i in range(1000): >>>> m = make_deficient() >>>> matrices.append(m) >>>> # The SVD tolerances >>>> S = npl.svd(m, compute_uv=False) >>>> S0 = S.max() >>>> # u in Golub and Van Loan == numpy eps / 2 >>>> eps = np.finfo(m.dtype).eps >>>> u = eps / 2 >>>> # Current numpy matrix_rank algorithm >>>> ranks.append(np.linalg.matrix_rank(m)) >>>> # Which is the same as: >>>> tol_s0 = S0 * eps >>>> # ranks.append(np.linalg.matrix_rank(m, tol=tol_s0)) >>>> # Golub and Van Loan suggestion >>>> tol_inf = npl.norm(m, np.inf) * u >>>> ranks_inf.append(np.linalg.matrix_rank(m, tol=tol_inf)) >>>> # Upper bound of |X|_{inf} >>>> tol_ub_inf = tol_s0 * np.sqrt(N) / 2 >>>> ranks_ub_inf.append(np.linalg.matrix_rank(m, tol=tol_ub_inf)) >>>> # Times 2 fudge >>>> tol_ub2_inf = tol_s0 * np.sqrt(N) >>>> ranks_ub2_inf.append(np.linalg.matrix_rank(m, tol=tol_ub2_inf)) >>>> # MATLAB algorithm >>>> tol_mlab = tol_s0 * max(m.shape) >>>> ranks_mlab.append(np.linalg.matrix_rank(m, tol=tol_mlab)) >>>> # Collect tols >>>> tols[i] = tol_s0, tol_inf, tol_ub_inf, tol_ub2_inf, tol_mlab, S.min() >>>> >>>> rel_tols = tols / tols[:, -1][:, None] >>>> >>>> fmt = 'Percent undetected %s: %3.1f, tol / S.min(): %2.3f' >>>> max_rank = min(M, N) >>>> for name, ranks, mrt in zip( >>>> ('current', 'inf norm', 'upper bound inf norm', >>>> 'upper bound inf norm * 2', 'MATLAB'), >>>> (ranks, ranks_inf, ranks_ub_inf, ranks_ub2_inf, ranks_mlab), >>>> rel_tols.mean(axis=0)[:5]): >>>> pcnt = np.sum(np.array(ranks) == max_rank) / 1000. * 100 >>>> print fmt % (name, pcnt, mrt) >>>> </script> >>> >>> >>> The polynomial fitting uses eps times the largest array dimension for the >>> relative condition number. IIRC, that choice traces back to numerical >>> recipes. > > Chuck - sorry - I didn't understand what you were saying, and now I > think you were proposing the MATLAB algorithm. I can't find that in > Numerical Recipes - can you? It would be helpful as a reference. > >> This is the same as Matlab, right? > > Yes, I believe so, i.e: > > tol = S.max() * np.finfo(M.dtype).eps * max((m, n)) > > from my original email. > >> If the Matlab condition is the most conservative, then it seems like a >> reasonable choice -- conservative is good so long as your false >> positive rate doesn't become to high, and presumably Matlab has enough >> user experience to know whether the false positive rate is too high. > > Are we agreeing to go for the Matlab algorithm?
As extra data, current Numerical Recipes (2007, p 67) appears to prefer: tol = S.max() * np.finfo(M.dtype).eps / 2. * np.sqrt(m + n + 1.) There's a discussion of algorithms in: @article{konstantinides1988statistical, title={Statistical analysis of effective singular values in matrix rank determination}, author={Konstantinides, K. and Yao, K.}, journal={Acoustics, Speech and Signal Processing, IEEE Transactions on}, volume={36}, number={5}, pages={757--763}, year={1988}, publisher={IEEE} } Yes, restricted access: http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=1585&tag=1 Cheers, Matthew _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion