On Fri, Jun 15, 2012 at 6:39 PM, Matthew Brett <matthew.br...@gmail.com>wrote:
> Hi, > > On Thu, Jun 14, 2012 at 8:10 PM, Charles R Harris > <charlesr.har...@gmail.com> wrote: > > > > > > On Thu, Jun 14, 2012 at 8:06 PM, Matthew Brett <matthew.br...@gmail.com> > > wrote: > >> > >> Hi, > >> > >> I noticed that numpy.linalg.matrix_rank sometimes gives full rank for > >> matrices that are numerically rank deficient: > >> > >> If I repeatedly make random matrices, then set the first column to be > >> equal to the sum of the second and third columns: > >> > >> def make_deficient(): > >> X = np.random.normal(size=(40, 10)) > >> deficient_X = X.copy() > >> deficient_X[:, 0] = deficient_X[:, 1] + deficient_X[:, 2] > >> return deficient_X > >> > >> then the current numpy.linalg.matrix_rank algorithm returns full rank > >> (10) in about 8 percent of cases (see appended script). > >> > >> I think this is a tolerance problem. The ``matrix_rank`` algorithm > >> does this by default: > >> > >> S = spl.svd(M, compute_uv=False) > >> tol = S.max() * np.finfo(S.dtype).eps > >> return np.sum(S > tol) > >> > >> I guess we'd we want the lowest tolerance that nearly always or always > >> identifies numerically rank deficient matrices. I suppose one way of > >> looking at whether the tolerance is in the right range is to compare > >> the calculated tolerance (``tol``) to the minimum singular value > >> (``S.min()``) because S.min() in our case should be very small and > >> indicate the rank deficiency. The mean value of tol / S.min() for the > >> current algorithm, across many iterations, is about 2.8. We might > >> hope this value would be higher than 1, but not much higher, otherwise > >> we might be rejecting too many columns. > >> > >> Our current algorithm for tolerance is the same as the 2-norm of M * > >> eps. We're citing Golub and Van Loan for this, but now I look at our > >> copy (p 261, last para) - they seem to be suggesting using u * |M| > >> where u = (p 61, section 2.4.2) eps / 2. (see [1]). I think the Golub > >> and Van Loan suggestion corresponds to: > >> > >> tol = np.linalg.norm(M, np.inf) * np.finfo(M.dtype).eps / 2 > >> > >> This tolerance gives full rank for these rank-deficient matrices in > >> about 39 percent of cases (tol / S.min() ratio of 1.7) > >> > >> We see on p 56 (section 2.3.2) that: > >> > >> m, n = M.shape > >> 1 / sqrt(n) . |M|_{inf} <= |M|_2 > >> > >> So we can get an upper bound on |M|_{inf} with |M|_2 * sqrt(n). > Setting: > >> > >> tol = S.max() * np.finfo(M.dtype).eps / 2 * np.sqrt(n) > >> > >> gives about 0.5 percent error (tol / S.min() of 4.4) > >> > >> Using the Mathworks threshold [2]: > >> > >> tol = S.max() * np.finfo(M.dtype).eps * max((m, n)) > >> > >> There are no false negatives (0 percent rank 10), but tol / S.min() is > >> around 110 - so conservative, in this case. > >> > >> So - summary - I'm worrying our current threshold is too small, > >> letting through many rank-deficient matrices without detection. I may > >> have misread Golub and Van Loan, but maybe we aren't doing what they > >> suggest. Maybe what we could use is either the MATLAB threshold or > >> something like: > >> > >> tol = S.max() * np.finfo(M.dtype).eps * np.sqrt(n) > >> > >> - so 2 * the upper bound for the inf norm = 2 * |M|_2 * sqrt(n) . This > >> gives 0 percent misses and tol / S.min() of 8.7. > >> > >> What do y'all think? > >> > >> Best, > >> > >> Matthew > >> > >> [1] > >> > http://matthew-brett.github.com/pydagogue/floating_error.html#machine-epsilon > >> [2] http://www.mathworks.com/help/techdoc/ref/rank.html > >> > >> Output from script: > >> > >> Percent undetected current: 9.8, tol / S.min(): 2.762 > >> Percent undetected inf norm: 39.1, tol / S.min(): 1.667 > >> Percent undetected upper bound inf norm: 0.5, tol / S.min(): 4.367 > >> Percent undetected upper bound inf norm * 2: 0.0, tol / S.min(): 8.734 > >> Percent undetected MATLAB: 0.0, tol / S.min(): 110.477 > >> > >> <script> > >> import numpy as np > >> import scipy.linalg as npl > >> > >> M = 40 > >> N = 10 > >> > >> def make_deficient(): > >> X = np.random.normal(size=(M, N)) > >> deficient_X = X.copy() > >> if M > N: # Make a column deficient > >> deficient_X[:, 0] = deficient_X[:, 1] + deficient_X[:, 2] > >> else: # Make a row deficient > >> deficient_X[0] = deficient_X[1] + deficient_X[2] > >> return deficient_X > >> > >> matrices = [] > >> ranks = [] > >> ranks_inf = [] > >> ranks_ub_inf = [] > >> ranks_ub2_inf = [] > >> ranks_mlab = [] > >> tols = np.zeros((1000, 6)) > >> for i in range(1000): > >> m = make_deficient() > >> matrices.append(m) > >> # The SVD tolerances > >> S = npl.svd(m, compute_uv=False) > >> S0 = S.max() > >> # u in Golub and Van Loan == numpy eps / 2 > >> eps = np.finfo(m.dtype).eps > >> u = eps / 2 > >> # Current numpy matrix_rank algorithm > >> ranks.append(np.linalg.matrix_rank(m)) > >> # Which is the same as: > >> tol_s0 = S0 * eps > >> # ranks.append(np.linalg.matrix_rank(m, tol=tol_s0)) > >> # Golub and Van Loan suggestion > >> tol_inf = npl.norm(m, np.inf) * u > >> ranks_inf.append(np.linalg.matrix_rank(m, tol=tol_inf)) > >> # Upper bound of |X|_{inf} > >> tol_ub_inf = tol_s0 * np.sqrt(N) / 2 > >> ranks_ub_inf.append(np.linalg.matrix_rank(m, tol=tol_ub_inf)) > >> # Times 2 fudge > >> tol_ub2_inf = tol_s0 * np.sqrt(N) > >> ranks_ub2_inf.append(np.linalg.matrix_rank(m, tol=tol_ub2_inf)) > >> # MATLAB algorithm > >> tol_mlab = tol_s0 * max(m.shape) > >> ranks_mlab.append(np.linalg.matrix_rank(m, tol=tol_mlab)) > >> # Collect tols > >> tols[i] = tol_s0, tol_inf, tol_ub_inf, tol_ub2_inf, tol_mlab, S.min() > >> > >> rel_tols = tols / tols[:, -1][:, None] > >> > >> fmt = 'Percent undetected %s: %3.1f, tol / S.min(): %2.3f' > >> max_rank = min(M, N) > >> for name, ranks, mrt in zip( > >> ('current', 'inf norm', 'upper bound inf norm', > >> 'upper bound inf norm * 2', 'MATLAB'), > >> (ranks, ranks_inf, ranks_ub_inf, ranks_ub2_inf, ranks_mlab), > >> rel_tols.mean(axis=0)[:5]): > >> pcnt = np.sum(np.array(ranks) == max_rank) / 1000. * 100 > >> print fmt % (name, pcnt, mrt) > >> </script> > > > > > > The polynomial fitting uses eps times the largest array dimension for the > > relative condition number. IIRC, that choice traces back to numerical > > recipes. > > The problem for that as a general metric would be that larger values > in the data would tend to lead to larger numerical error, but this > won't be reflected in the tolerance. For example, running my script > with your metric finds 0 errors for the normal distribution var = 1, > but 100 percent error for a variance of 1000. > > It's a *relative* condition number. You need to multiply it times the largest singular value to find the cutoff. > Percent undetected current: 0.0, tol / S.min(): 3.545 > Percent undetected inf norm: 0.0, tol / S.min(): 2.139 > Percent undetected upper bound inf norm: 0.0, tol / S.min(): 5.605 > Percent undetected upper bound inf norm * 2: 0.0, tol / S.min(): 11.209 > Percent undetected MATLAB: 0.0, tol / S.min(): 141.785 > Percent undetected Chuck: 100.0, tol / S.min(): 0.013 > > My worry is that I haven't sampled the space of all possible matrix > sizes and scalings. First pass suggests that the values will be > different on different plaforms (at least, between a PPC 32 bit and an > Intel 64 bit). I think the tolerance is wrong at the moment, and it > looks like the Golub and Van Loan suggestion will not work as written. > The MATLAB algorithm is some kind of standard and has been battle > tested. If we are going to change, it seems tempting to use that. > > Chuck
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion