>On Sat, Jun 16, 2012 at 4:39 PM, Nathaniel Smith <n...@pobox.com> wrote: >> On Sat, Jun 16, 2012 at 9:03 PM, Matthew Brett <matthew.br...@gmail.com> >> wrote: >>> Hi, >>> >>> On Sat, Jun 16, 2012 at 10:40 AM, Nathaniel Smith <n...@pobox.com> wrote: >>>> On Fri, Jun 15, 2012 at 4:10 AM, Charles R Harris >>>> <charlesr.har...@gmail.com> wrote: >>>>> >>>>> >>>>> On Thu, Jun 14, 2012 at 8:06 PM, Matthew Brett <matthew.br...@gmail.com> >>>>> wrote: >>>>>> >>>>>> Hi, >>>>>> >>>>>> I noticed that numpy.linalg.matrix_rank sometimes gives full rank for >>>>>> matrices that are numerically rank deficient: >>>>>> >>>>>> If I repeatedly make random matrices, then set the first column to be >>>>>> equal to the sum of the second and third columns: >>>>>> >>>>>> def make_deficient(): >>>>>> X = np.random.normal(size=(40, 10)) >>>>>> deficient_X = X.copy() >>>>>> deficient_X[:, 0] = deficient_X[:, 1] + deficient_X[:, 2] >>>>>> return deficient_X >>>>>> >>>>>> then the current numpy.linalg.matrix_rank algorithm returns full rank >>>>>> (10) in about 8 percent of cases (see appended script). >>>>>> >>>>>> I think this is a tolerance problem. The ``matrix_rank`` algorithm >>>>>> does this by default: >>>>>> >>>>>> S = spl.svd(M, compute_uv=False) >>>>>> tol = S.max() * np.finfo(S.dtype).eps >>>>>> return np.sum(S > tol) >>>>>> >>>>>> I guess we'd we want the lowest tolerance that nearly always or always >>>>>> identifies numerically rank deficient matrices. I suppose one way of >>>>>> looking at whether the tolerance is in the right range is to compare >>>>>> the calculated tolerance (``tol``) to the minimum singular value >>>>>> (``S.min()``) because S.min() in our case should be very small and >>>>>> indicate the rank deficiency. The mean value of tol / S.min() for the >>>>>> current algorithm, across many iterations, is about 2.8. We might >>>>>> hope this value would be higher than 1, but not much higher, otherwise >>>>>> we might be rejecting too many columns. >>>>>> >>>>>> Our current algorithm for tolerance is the same as the 2-norm of M * >>>>>> eps. We're citing Golub and Van Loan for this, but now I look at our >>>>>> copy (p 261, last para) - they seem to be suggesting using u * |M| >>>>>> where u = (p 61, section 2.4.2) eps / 2. (see [1]). I think the Golub >>>>>> and Van Loan suggestion corresponds to: >>>>>> >>>>>> tol = np.linalg.norm(M, np.inf) * np.finfo(M.dtype).eps / 2 >>>>>> >>>>>> This tolerance gives full rank for these rank-deficient matrices in >>>>>> about 39 percent of cases (tol / S.min() ratio of 1.7) >>>>>> >>>>>> We see on p 56 (section 2.3.2) that: >>>>>> >>>>>> m, n = M.shape >>>>>> 1 / sqrt(n) . |M|_{inf} <= |M|_2 >>>>>> >>>>>> So we can get an upper bound on |M|_{inf} with |M|_2 * sqrt(n). Setting: >>>>>> >>>>>> tol = S.max() * np.finfo(M.dtype).eps / 2 * np.sqrt(n) >>>>>> >>>>>> gives about 0.5 percent error (tol / S.min() of 4.4) >>>>>> >>>>>> Using the Mathworks threshold [2]: >>>>>> >>>>>> tol = S.max() * np.finfo(M.dtype).eps * max((m, n)) >>>>>> >>>>>> There are no false negatives (0 percent rank 10), but tol / S.min() is >>>>>> around 110 - so conservative, in this case. >>>>>> >>>>>> So - summary - I'm worrying our current threshold is too small, >>>>>> letting through many rank-deficient matrices without detection. I may >>>>>> have misread Golub and Van Loan, but maybe we aren't doing what they >>>>>> suggest. Maybe what we could use is either the MATLAB threshold or >>>>>> something like: >>>>>> >>>>>> tol = S.max() * np.finfo(M.dtype).eps * np.sqrt(n) >>>>>> >>>>>> - so 2 * the upper bound for the inf norm = 2 * |M|_2 * sqrt(n) . This >>>>>> gives 0 percent misses and tol / S.min() of 8.7. >>>>>> >>>>>> What do y'all think? >>>>>> >>>>>> Best, >>>>>> >>>>>> Matthew >>>>>> >>>>>> [1] >>>>>> http://matthew-brett.github.com/pydagogue/floating_error.html#machine-epsilon >>>>>> [2] http://www.mathworks.com/help/techdoc/ref/rank.html >>>>>> >>>>>> Output from script: >>>>>> >>>>>> Percent undetected current: 9.8, tol / S.min(): 2.762 >>>>>> Percent undetected inf norm: 39.1, tol / S.min(): 1.667 >>>>>> Percent undetected upper bound inf norm: 0.5, tol / S.min(): 4.367 >>>>>> Percent undetected upper bound inf norm * 2: 0.0, tol / S.min(): 8.734 >>>>>> Percent undetected MATLAB: 0.0, tol / S.min(): 110.477 >>>>>> >>>>>> <script> >>>>>> import numpy as np >>>>>> import scipy.linalg as npl >>>>>> >>>>>> M = 40 >>>>>> N = 10 >>>>>> >>>>>> def make_deficient(): >>>>>> X = np.random.normal(size=(M, N)) >>>>>> deficient_X = X.copy() >>>>>> if M > N: # Make a column deficient >>>>>> deficient_X[:, 0] = deficient_X[:, 1] + deficient_X[:, 2] >>>>>> else: # Make a row deficient >>>>>> deficient_X[0] = deficient_X[1] + deficient_X[2] >>>>>> return deficient_X >>>>>> >>>>>> matrices = [] >>>>>> ranks = [] >>>>>> ranks_inf = [] >>>>>> ranks_ub_inf = [] >>>>>> ranks_ub2_inf = [] >>>>>> ranks_mlab = [] >>>>>> tols = np.zeros((1000, 6)) >>>>>> for i in range(1000): >>>>>> m = make_deficient() >>>>>> matrices.append(m) >>>>>> # The SVD tolerances >>>>>> S = npl.svd(m, compute_uv=False) >>>>>> S0 = S.max() >>>>>> # u in Golub and Van Loan == numpy eps / 2 >>>>>> eps = np.finfo(m.dtype).eps >>>>>> u = eps / 2 >>>>>> # Current numpy matrix_rank algorithm >>>>>> ranks.append(np.linalg.matrix_rank(m)) >>>>>> # Which is the same as: >>>>>> tol_s0 = S0 * eps >>>>>> # ranks.append(np.linalg.matrix_rank(m, tol=tol_s0)) >>>>>> # Golub and Van Loan suggestion >>>>>> tol_inf = npl.norm(m, np.inf) * u >>>>>> ranks_inf.append(np.linalg.matrix_rank(m, tol=tol_inf)) >>>>>> # Upper bound of |X|_{inf} >>>>>> tol_ub_inf = tol_s0 * np.sqrt(N) / 2 >>>>>> ranks_ub_inf.append(np.linalg.matrix_rank(m, tol=tol_ub_inf)) >>>>>> # Times 2 fudge >>>>>> tol_ub2_inf = tol_s0 * np.sqrt(N) >>>>>> ranks_ub2_inf.append(np.linalg.matrix_rank(m, tol=tol_ub2_inf)) >>>>>> # MATLAB algorithm >>>>>> tol_mlab = tol_s0 * max(m.shape) >>>>>> ranks_mlab.append(np.linalg.matrix_rank(m, tol=tol_mlab)) >>>>>> # Collect tols >>>>>> tols[i] = tol_s0, tol_inf, tol_ub_inf, tol_ub2_inf, tol_mlab, S.min() >>>>>> >>>>>> rel_tols = tols / tols[:, -1][:, None] >>>>>> >>>>>> fmt = 'Percent undetected %s: %3.1f, tol / S.min(): %2.3f' >>>>>> max_rank = min(M, N) >>>>>> for name, ranks, mrt in zip( >>>>>> ('current', 'inf norm', 'upper bound inf norm', >>>>>> 'upper bound inf norm * 2', 'MATLAB'), >>>>>> (ranks, ranks_inf, ranks_ub_inf, ranks_ub2_inf, ranks_mlab), >>>>>> rel_tols.mean(axis=0)[:5]): >>>>>> pcnt = np.sum(np.array(ranks) == max_rank) / 1000. * 100 >>>>>> print fmt % (name, pcnt, mrt) >>>>>> </script> >>>>> >>>>> >>>>> The polynomial fitting uses eps times the largest array dimension for the >>>>> relative condition number. IIRC, that choice traces back to numerical >>>>> recipes. >>> >>> Chuck - sorry - I didn't understand what you were saying, and now I >>> think you were proposing the MATLAB algorithm. I can't find that in >>> Numerical Recipes - can you? It would be helpful as a reference. >>> >>>> This is the same as Matlab, right? >>> >>> Yes, I believe so, i.e: >>> >>> tol = S.max() * np.finfo(M.dtype).eps * max((m, n)) >>> >>> from my original email. >>> >>>> If the Matlab condition is the most conservative, then it seems like a >>>> reasonable choice -- conservative is good so long as your false >>>> positive rate doesn't become to high, and presumably Matlab has enough >>>> user experience to know whether the false positive rate is too high. >>> >>> Are we agreeing to go for the Matlab algorithm? >>> >>> If so, how should this be managed? Just changing it may change the >>> output of code using numpy >= 1.5.0, but then again, the threshold is >>> probably incorrect. >>> >>> Fix and break or FutureWarning with something like: >>> >>> def matrix_rank(M, tol=None): >>> >>> where ``tol`` can be a string like ``maxdim``? >> >> I dunno, I don't think we should do a big deprecation dance for every >> bug fix. Is this a bug fix, so numpy will simply start producing more >> accurate results on a given problem? I guess there isn't really a >> right answer here (though claiming that [a, b, a+b] is full-rank is >> clearly broken, and the matlab algorithm seems reasonable for >> answering the specific question of whether a matrix is full rank), so >> we'll have to hope some users speak up... > >I don't see a problem changing this as a bugfix. >statsmodels still has, I think, the original scipy.stats.models >version for rank which is still much higher for any non-huge array and >float, cond=1.0e-12. > >Josef
+ 1 for making the default "matlab" : it sounds like it would be the least confusing. It also seems to me that a bug fix is probably right procedure. Last, I like best having only the matlab default (options seem uncessary). cheers JB _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion