On Thu, Jun 14, 2012 at 8:06 PM, Matthew Brett <matthew.br...@gmail.com>wrote:

> Hi,
>
> I noticed that numpy.linalg.matrix_rank sometimes gives full rank for
> matrices that are numerically rank deficient:
>
> If I repeatedly make random matrices, then set the first column to be
> equal to the sum of the second and third columns:
>
> def make_deficient():
>    X = np.random.normal(size=(40, 10))
>    deficient_X = X.copy()
>    deficient_X[:, 0] = deficient_X[:, 1] + deficient_X[:, 2]
>    return deficient_X
>
> then the current numpy.linalg.matrix_rank algorithm returns full rank
> (10) in about 8 percent of cases (see appended script).
>
> I think this is a tolerance problem.  The ``matrix_rank`` algorithm
> does this by default:
>
> S = spl.svd(M, compute_uv=False)
> tol = S.max() * np.finfo(S.dtype).eps
> return np.sum(S > tol)
>
> I guess we'd we want the lowest tolerance that nearly always or always
> identifies numerically rank deficient matrices.  I suppose one way of
> looking at whether the tolerance is in the right range is to compare
> the calculated tolerance (``tol``) to the minimum singular value
> (``S.min()``) because S.min() in our case should be very small and
> indicate the rank deficiency. The mean value of tol / S.min() for the
> current algorithm, across many iterations, is about 2.8.  We might
> hope this value would be higher than 1, but not much higher, otherwise
> we might be rejecting too many columns.
>
> Our current algorithm for tolerance is the same as the 2-norm of M *
> eps.  We're citing Golub and Van Loan for this, but now I look at our
> copy (p 261, last para) - they seem to be suggesting using u * |M|
> where u = (p 61, section 2.4.2) eps /  2. (see [1]). I think the Golub
> and Van Loan suggestion corresponds to:
>
> tol = np.linalg.norm(M, np.inf) * np.finfo(M.dtype).eps / 2
>
> This tolerance gives full rank for these rank-deficient matrices in
> about 39 percent of cases (tol / S.min() ratio of 1.7)
>
> We see on p 56 (section 2.3.2) that:
>
> m, n = M.shape
> 1 / sqrt(n) . |M|_{inf} <= |M|_2
>
> So we can get an upper bound on |M|_{inf} with |M|_2 * sqrt(n).  Setting:
>
> tol = S.max() * np.finfo(M.dtype).eps / 2 * np.sqrt(n)
>
> gives about 0.5 percent error (tol / S.min() of 4.4)
>
> Using the Mathworks threshold [2]:
>
> tol = S.max() * np.finfo(M.dtype).eps * max((m, n))
>
> There are no false negatives (0 percent rank 10), but tol / S.min() is
> around 110 - so conservative, in this case.
>
> So - summary - I'm worrying our current threshold is too small,
> letting through many rank-deficient matrices without detection.  I may
> have misread Golub and Van Loan, but maybe we aren't doing what they
> suggest.  Maybe what we could use is either the MATLAB threshold or
> something like:
>
> tol = S.max() * np.finfo(M.dtype).eps * np.sqrt(n)
>
> - so 2 * the upper bound for the inf norm = 2 * |M|_2 * sqrt(n) . This
> gives 0 percent misses and tol / S.min() of 8.7.
>
> What do y'all think?
>
> Best,
>
> Matthew
>
> [1]
> http://matthew-brett.github.com/pydagogue/floating_error.html#machine-epsilon
> [2] http://www.mathworks.com/help/techdoc/ref/rank.html
>
> Output from script:
>
> Percent undetected current: 9.8, tol / S.min(): 2.762
> Percent undetected inf norm: 39.1, tol / S.min(): 1.667
> Percent undetected upper bound inf norm: 0.5, tol / S.min(): 4.367
> Percent undetected upper bound inf norm * 2: 0.0, tol / S.min(): 8.734
> Percent undetected MATLAB: 0.0, tol / S.min(): 110.477
>
> <script>
> import numpy as np
> import scipy.linalg as npl
>
> M = 40
> N = 10
>
> def make_deficient():
>    X = np.random.normal(size=(M, N))
>    deficient_X = X.copy()
>    if M > N: # Make a column deficient
>        deficient_X[:, 0] = deficient_X[:, 1] + deficient_X[:, 2]
>    else: # Make a row deficient
>        deficient_X[0] = deficient_X[1] + deficient_X[2]
>    return deficient_X
>
> matrices = []
> ranks = []
> ranks_inf = []
> ranks_ub_inf = []
> ranks_ub2_inf = []
> ranks_mlab = []
> tols = np.zeros((1000, 6))
> for i in range(1000):
>    m = make_deficient()
>    matrices.append(m)
>    # The SVD tolerances
>    S = npl.svd(m, compute_uv=False)
>    S0 = S.max()
>    # u in Golub and Van Loan == numpy eps / 2
>    eps = np.finfo(m.dtype).eps
>    u = eps / 2
>    # Current numpy matrix_rank algorithm
>    ranks.append(np.linalg.matrix_rank(m))
>    # Which is the same as:
>    tol_s0 = S0 * eps
>    # ranks.append(np.linalg.matrix_rank(m, tol=tol_s0))
>    # Golub and Van Loan suggestion
>    tol_inf = npl.norm(m, np.inf) * u
>    ranks_inf.append(np.linalg.matrix_rank(m, tol=tol_inf))
>    # Upper bound of |X|_{inf}
>    tol_ub_inf = tol_s0 * np.sqrt(N) / 2
>    ranks_ub_inf.append(np.linalg.matrix_rank(m, tol=tol_ub_inf))
>    # Times 2 fudge
>    tol_ub2_inf = tol_s0 * np.sqrt(N)
>    ranks_ub2_inf.append(np.linalg.matrix_rank(m, tol=tol_ub2_inf))
>    # MATLAB algorithm
>    tol_mlab = tol_s0 * max(m.shape)
>    ranks_mlab.append(np.linalg.matrix_rank(m, tol=tol_mlab))
>    # Collect tols
>    tols[i] = tol_s0, tol_inf, tol_ub_inf, tol_ub2_inf, tol_mlab, S.min()
>
> rel_tols = tols / tols[:, -1][:, None]
>
> fmt = 'Percent undetected %s: %3.1f, tol / S.min(): %2.3f'
> max_rank = min(M, N)
> for name, ranks, mrt in zip(
>    ('current', 'inf norm', 'upper bound inf norm',
>     'upper bound inf norm * 2', 'MATLAB'),
>    (ranks, ranks_inf, ranks_ub_inf, ranks_ub2_inf, ranks_mlab),
>    rel_tols.mean(axis=0)[:5]):
>    pcnt = np.sum(np.array(ranks) == max_rank) / 1000. * 100
>    print fmt % (name, pcnt, mrt)
> </script>
>

The polynomial fitting uses eps times the largest array dimension for the
relative condition number. IIRC, that choice traces back to numerical
recipes.

Chuck
_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Reply via email to