On Tue, Jun 26, 2012 at 3:42 PM, Matthew Brett <matthew.br...@gmail.com>wrote:
> Hi, > > On Mon, Jun 18, 2012 at 3:50 PM, Matthew Brett <matthew.br...@gmail.com> > wrote: > > Hi, > > > > On Sun, Jun 17, 2012 at 7:22 PM, Charles R Harris > > <charlesr.har...@gmail.com> wrote: > >> > >> > >> On Sat, Jun 16, 2012 at 2:33 PM, Matthew Brett <matthew.br...@gmail.com > > > >> wrote: > >>> > >>> Hi, > >>> > >>> On Sat, Jun 16, 2012 at 8:03 PM, Matthew Brett < > matthew.br...@gmail.com> > >>> wrote: > >>> > Hi, > >>> > > >>> > On Sat, Jun 16, 2012 at 10:40 AM, Nathaniel Smith <n...@pobox.com> > wrote: > >>> >> On Fri, Jun 15, 2012 at 4:10 AM, Charles R Harris > >>> >> <charlesr.har...@gmail.com> wrote: > >>> >>> > >>> >>> > >>> >>> On Thu, Jun 14, 2012 at 8:06 PM, Matthew Brett > >>> >>> <matthew.br...@gmail.com> > >>> >>> wrote: > >>> >>>> > >>> >>>> Hi, > >>> >>>> > >>> >>>> I noticed that numpy.linalg.matrix_rank sometimes gives full rank > for > >>> >>>> matrices that are numerically rank deficient: > >>> >>>> > >>> >>>> If I repeatedly make random matrices, then set the first column > to be > >>> >>>> equal to the sum of the second and third columns: > >>> >>>> > >>> >>>> def make_deficient(): > >>> >>>> X = np.random.normal(size=(40, 10)) > >>> >>>> deficient_X = X.copy() > >>> >>>> deficient_X[:, 0] = deficient_X[:, 1] + deficient_X[:, 2] > >>> >>>> return deficient_X > >>> >>>> > >>> >>>> then the current numpy.linalg.matrix_rank algorithm returns full > rank > >>> >>>> (10) in about 8 percent of cases (see appended script). > >>> >>>> > >>> >>>> I think this is a tolerance problem. The ``matrix_rank`` > algorithm > >>> >>>> does this by default: > >>> >>>> > >>> >>>> S = spl.svd(M, compute_uv=False) > >>> >>>> tol = S.max() * np.finfo(S.dtype).eps > >>> >>>> return np.sum(S > tol) > >>> >>>> > >>> >>>> I guess we'd we want the lowest tolerance that nearly always or > >>> >>>> always > >>> >>>> identifies numerically rank deficient matrices. I suppose one > way of > >>> >>>> looking at whether the tolerance is in the right range is to > compare > >>> >>>> the calculated tolerance (``tol``) to the minimum singular value > >>> >>>> (``S.min()``) because S.min() in our case should be very small and > >>> >>>> indicate the rank deficiency. The mean value of tol / S.min() for > the > >>> >>>> current algorithm, across many iterations, is about 2.8. We might > >>> >>>> hope this value would be higher than 1, but not much higher, > >>> >>>> otherwise > >>> >>>> we might be rejecting too many columns. > >>> >>>> > >>> >>>> Our current algorithm for tolerance is the same as the 2-norm of > M * > >>> >>>> eps. We're citing Golub and Van Loan for this, but now I look at > our > >>> >>>> copy (p 261, last para) - they seem to be suggesting using u * |M| > >>> >>>> where u = (p 61, section 2.4.2) eps / 2. (see [1]). I think the > >>> >>>> Golub > >>> >>>> and Van Loan suggestion corresponds to: > >>> >>>> > >>> >>>> tol = np.linalg.norm(M, np.inf) * np.finfo(M.dtype).eps / 2 > >>> >>>> > >>> >>>> This tolerance gives full rank for these rank-deficient matrices > in > >>> >>>> about 39 percent of cases (tol / S.min() ratio of 1.7) > >>> >>>> > >>> >>>> We see on p 56 (section 2.3.2) that: > >>> >>>> > >>> >>>> m, n = M.shape > >>> >>>> 1 / sqrt(n) . |M|_{inf} <= |M|_2 > >>> >>>> > >>> >>>> So we can get an upper bound on |M|_{inf} with |M|_2 * sqrt(n). > >>> >>>> Setting: > >>> >>>> > >>> >>>> tol = S.max() * np.finfo(M.dtype).eps / 2 * np.sqrt(n) > >>> >>>> > >>> >>>> gives about 0.5 percent error (tol / S.min() of 4.4) > >>> >>>> > >>> >>>> Using the Mathworks threshold [2]: > >>> >>>> > >>> >>>> tol = S.max() * np.finfo(M.dtype).eps * max((m, n)) > >>> >>>> > >>> >>>> There are no false negatives (0 percent rank 10), but tol / > S.min() > >>> >>>> is > >>> >>>> around 110 - so conservative, in this case. > >>> >>>> > >>> >>>> So - summary - I'm worrying our current threshold is too small, > >>> >>>> letting through many rank-deficient matrices without detection. I > >>> >>>> may > >>> >>>> have misread Golub and Van Loan, but maybe we aren't doing what > they > >>> >>>> suggest. Maybe what we could use is either the MATLAB threshold > or > >>> >>>> something like: > >>> >>>> > >>> >>>> tol = S.max() * np.finfo(M.dtype).eps * np.sqrt(n) > >>> >>>> > >>> >>>> - so 2 * the upper bound for the inf norm = 2 * |M|_2 * sqrt(n) . > >>> >>>> This > >>> >>>> gives 0 percent misses and tol / S.min() of 8.7. > >>> >>>> > >>> >>>> What do y'all think? > >>> >>>> > >>> >>>> Best, > >>> >>>> > >>> >>>> Matthew > >>> >>>> > >>> >>>> [1] > >>> >>>> > >>> >>>> > http://matthew-brett.github.com/pydagogue/floating_error.html#machine-epsilon > >>> >>>> [2] http://www.mathworks.com/help/techdoc/ref/rank.html > >>> >>>> > >>> >>>> Output from script: > >>> >>>> > >>> >>>> Percent undetected current: 9.8, tol / S.min(): 2.762 > >>> >>>> Percent undetected inf norm: 39.1, tol / S.min(): 1.667 > >>> >>>> Percent undetected upper bound inf norm: 0.5, tol / S.min(): 4.367 > >>> >>>> Percent undetected upper bound inf norm * 2: 0.0, tol / S.min(): > >>> >>>> 8.734 > >>> >>>> Percent undetected MATLAB: 0.0, tol / S.min(): 110.477 > >>> >>>> > >>> >>>> <script> > >>> >>>> import numpy as np > >>> >>>> import scipy.linalg as npl > >>> >>>> > >>> >>>> M = 40 > >>> >>>> N = 10 > >>> >>>> > >>> >>>> def make_deficient(): > >>> >>>> X = np.random.normal(size=(M, N)) > >>> >>>> deficient_X = X.copy() > >>> >>>> if M > N: # Make a column deficient > >>> >>>> deficient_X[:, 0] = deficient_X[:, 1] + deficient_X[:, 2] > >>> >>>> else: # Make a row deficient > >>> >>>> deficient_X[0] = deficient_X[1] + deficient_X[2] > >>> >>>> return deficient_X > >>> >>>> > >>> >>>> matrices = [] > >>> >>>> ranks = [] > >>> >>>> ranks_inf = [] > >>> >>>> ranks_ub_inf = [] > >>> >>>> ranks_ub2_inf = [] > >>> >>>> ranks_mlab = [] > >>> >>>> tols = np.zeros((1000, 6)) > >>> >>>> for i in range(1000): > >>> >>>> m = make_deficient() > >>> >>>> matrices.append(m) > >>> >>>> # The SVD tolerances > >>> >>>> S = npl.svd(m, compute_uv=False) > >>> >>>> S0 = S.max() > >>> >>>> # u in Golub and Van Loan == numpy eps / 2 > >>> >>>> eps = np.finfo(m.dtype).eps > >>> >>>> u = eps / 2 > >>> >>>> # Current numpy matrix_rank algorithm > >>> >>>> ranks.append(np.linalg.matrix_rank(m)) > >>> >>>> # Which is the same as: > >>> >>>> tol_s0 = S0 * eps > >>> >>>> # ranks.append(np.linalg.matrix_rank(m, tol=tol_s0)) > >>> >>>> # Golub and Van Loan suggestion > >>> >>>> tol_inf = npl.norm(m, np.inf) * u > >>> >>>> ranks_inf.append(np.linalg.matrix_rank(m, tol=tol_inf)) > >>> >>>> # Upper bound of |X|_{inf} > >>> >>>> tol_ub_inf = tol_s0 * np.sqrt(N) / 2 > >>> >>>> ranks_ub_inf.append(np.linalg.matrix_rank(m, tol=tol_ub_inf)) > >>> >>>> # Times 2 fudge > >>> >>>> tol_ub2_inf = tol_s0 * np.sqrt(N) > >>> >>>> ranks_ub2_inf.append(np.linalg.matrix_rank(m, tol=tol_ub2_inf)) > >>> >>>> # MATLAB algorithm > >>> >>>> tol_mlab = tol_s0 * max(m.shape) > >>> >>>> ranks_mlab.append(np.linalg.matrix_rank(m, tol=tol_mlab)) > >>> >>>> # Collect tols > >>> >>>> tols[i] = tol_s0, tol_inf, tol_ub_inf, tol_ub2_inf, tol_mlab, > >>> >>>> S.min() > >>> >>>> > >>> >>>> rel_tols = tols / tols[:, -1][:, None] > >>> >>>> > >>> >>>> fmt = 'Percent undetected %s: %3.1f, tol / S.min(): %2.3f' > >>> >>>> max_rank = min(M, N) > >>> >>>> for name, ranks, mrt in zip( > >>> >>>> ('current', 'inf norm', 'upper bound inf norm', > >>> >>>> 'upper bound inf norm * 2', 'MATLAB'), > >>> >>>> (ranks, ranks_inf, ranks_ub_inf, ranks_ub2_inf, ranks_mlab), > >>> >>>> rel_tols.mean(axis=0)[:5]): > >>> >>>> pcnt = np.sum(np.array(ranks) == max_rank) / 1000. * 100 > >>> >>>> print fmt % (name, pcnt, mrt) > >>> >>>> </script> > >>> >>> > >>> >>> > >>> >>> The polynomial fitting uses eps times the largest array dimension > for > >>> >>> the > >>> >>> relative condition number. IIRC, that choice traces back to > numerical > >>> >>> recipes. > >>> > > >>> > Chuck - sorry - I didn't understand what you were saying, and now I > >>> > think you were proposing the MATLAB algorithm. I can't find that in > >>> > Numerical Recipes - can you? It would be helpful as a reference. > >>> > > >>> >> This is the same as Matlab, right? > >>> > > >>> > Yes, I believe so, i.e: > >>> > > >>> > tol = S.max() * np.finfo(M.dtype).eps * max((m, n)) > >>> > > >>> > from my original email. > >>> > > >>> >> If the Matlab condition is the most conservative, then it seems > like a > >>> >> reasonable choice -- conservative is good so long as your false > >>> >> positive rate doesn't become to high, and presumably Matlab has > enough > >>> >> user experience to know whether the false positive rate is too high. > >>> > > >>> > Are we agreeing to go for the Matlab algorithm? > >>> > >>> As extra data, current Numerical Recipes (2007, p 67) appears to > prefer: > >>> > >>> tol = S.max() * np.finfo(M.dtype).eps / 2. * np.sqrt(m + n + 1.) > >> > >> > >> That's interesting, as something like that with a square root was my > first > >> choice for the least squares, but then someone mentioned the NR choice. > That > >> was all on the mailing list way several years back when I was fixing up > the > >> polynomial fitting routine. The NR reference is on page 517 of the 1986 > >> edition (FORTRAN), which might be hard to come by these days ;) > > > > Thanks for tracking that down, it's very helpful. > > > > For those of you not near a huge University library or your own > > private copy, p517 says: > > > > "A plausible answer to the question "how small is small", is to edit > > in this fashion all singular values whose ratio to the largest > > singular value is less then N times the machine precision \epsilon. > > (You might argue for root N, or a constant, instead of N as the > > multiple; that starts getting into hardware-dependent questions). > > > > Earlier (p510) we see the (General Linear Least Squares) problem being > > set up as A = (N x M) where N >= M. > > > > The 2007 edition replaces the "(You might argue... )" text with: (p 795) > > > > "(This is a more conservative recommendation than the default in > > section 2.6 which scales as N^{1/2})" > > > > and this in turn refers to the threshold: > > > > tol = S.max() * np.finfo(M.dtype).eps / 2. * np.sqrt(m + n + 1.) > > > > (p67) - which is justified as being (p71) " ... a default value based > > on expected roundoff error". > > > > I could not at first glance see any other justification for this > > threshold in the text. > > > > So, how about something like: > > > > def matrix_rank(M, tol='maxdim'): > > ... > > > > tol: {'maxdim', 'nr-roundoff'} or float > > If str, gives threshold strategy for tolerance relative to > > the maximum singular value, explained below. If float gives absolute > > tolerance below which singular values assumed zero. For the threshold > > strategies, we will call the maximum singular value``S.max()`` and > > the floating point epsilon for the working precision data type > > ``eps``. Default strategy is 'maxdim' corresponding to ``tol = > > S.max() * eps * max(M.shape)``. This is the MATLAB default; see also > > Numerical Recipes 2007. Other options are 'nr-roundoff' (also from > > Numerical Recipes 2007) corresponding to ``tol = S.max() * eps / 2 * > > np.sqrt(M.shape[0] + M.shape[1] + 1)``. > > > > ? > > Thinking about this more, I'm tempted to just go for the matlab / old > NR solution, that is: > > tol = S.max() * np.finfo(M.dtype).eps * max((m, n)) > > and not offer any other options. My reasoning is that the matrix_rank > code is basically trivial; it is a convenience function. If someone > wants a better matrix rank it's reasonable to copy these few lines > into a new function. So providing multiple options for tolerance > seems like overkill. > > Then the question is whether to use NR-2007: > > tol = S.max() * np.finfo(M.dtype).eps / 2. * np.sqrt(m + n + 1.) > > or the matlab default above. I'm leaning to the matlab version > because we can be more or less sure that it does not miss actual > numerical rank deficiency and it might be what people are expecting, > if they shift from matlab or compare to matlab. The NR-2007 might > well be closer to an accurate threshold for numerical rank deficiency, > but I haven't tested it with all variants and all platforms, and the > reduced false positives seems a minor gain compared to the slight > increase in risk of false negatives and difference from matlab. > > What do y'all think, > > I'm fine with that, and agree that it is likely to lead to fewer folks wondering why Matlab and numpy are different. A good explanation in the function documentation would be useful. Chuck
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion