>>>>> "PS" == Petr Savicky <savi...@cs.cas.cz> >>>>> on Fri, 8 May 2009 11:01:55 +0200 writes:
PS> On Wed, May 06, 2009 at 10:41:58AM +0200, Martin Maechler wrote: PD> I think that the real issue is that we actually do want almost-equal PD> numbers to be folded together. >> >> yes, this now (revision 48469) will happen by default, using signif(x, 15) >> where '15' is the default for the new optional argument 'digitsLabels' >> {better argument name? (but must nost start with 'label')} PS> Let me analyze the current behavior of factor(x) for numeric x with missing(levels) PS> and missing(labels). In this situation, levels are computed as sort(unique(x)) PS> from possibly transformed x. Then, labels are constructed by a conversion of the PS> levels to strings. PS> I understand the current (R 2.10.0, 2009-05-07 r48492) behavior as follows. PS> If keepUnique is FALSE (the default), then PS> - values x are transformed by signif(x, digitsLabels) PS> - labels are computed using as.character(levels) PS> - digitsLabels defaults to 15, but may be set to any integer value PS> If keepUnique is TRUE, then PS> - values x are preserved PS> - labels are computed using sprintf("%.*g", digitsLabels, levels) PS> - digitsLabels defaults to 17, but may be set to any integer value (in theory; in practice, I think I've suggested somewhere that it should be >= 17; but see below.) Your summary seems correct to me. PS> There are several situations, when this approach produces duplicated levels. PS> Besides the one described in my previous email, there are also others PS> factor(c(0.3, 0.1+0.2), keepUnique=TRUE, digitsLabels=15) yes, but this is not much sensical; I've already contemplated to produce a warning in such cases, something like if(keepUnique && digitsLabels < 17) warning(gettextf( "'digitsLabels = %d' is typically too small when 'keepUnique' is true", digitsLabels)) PS> factor(1 + 0:5 * 1e-16, digitsLabels=17) again, this does not make much sense; but why disallow the useR to shoot into his foot? PS> I would like to suggest a modification. It eliminates most of the cases, where PS> we get duplicated levels. It would eliminate all such cases, if the function PS> signif() works as expected. Unfortunately, if signif() works as it does in the PS> current versions of R, we still get duplicated levels. PS> The suggested modification is as follows. PS> If keepUnique is FALSE (the default), then PS> - values x are transformed by signif(x, digitsLabels) PS> - labels are computed using sprintf("%.*g", digitsLabels, levels) PS> - digitsLabels defaults to 15, but may be set to any integer value I tend like this change, given -- as you found yesterday -- that as.character() is not even preserving 15 digits. OTOH, as.character() has been in use for a very long history of S (and R), whereas using sprintf() is not back compatible with it and actually depends on the LIBC implementation of the system-sprintf. For that reason as.character() would be preferable. Hmm.... PS> If keepUnique is TRUE, then PS> - values x are preserved PS> - labels are computed using sprintf("%.*g", 17, levels) PS> - digitsLabels is ignored I had originally planned to do exactly the above. However, e.g., digitsLabels = 18 may be desired in some cases, and that's why I also left the possibility to apply it in the keepUnique case. PS> Arguments for the modification are the following. PS> 1. If keepUnique is FALSE, then computing labels using as.character() leads PS> to duplicated labels as demonstrated in my previous email. So, i suggest to PS> use sprintf("%.*g", digitsLabels, levels) instead of as.character(). {as said above, that seems sensible, though unfurtunately quite a bit less back-compatible!} PS> 2. If keepUnique is TRUE and we allow digitsLabels less than 17, then we get PS> duplicated labels. So, i suggest to force digitsLabels=17, if keepUnique=TRUE. PS> If signif(,digitsLabels) works as expected, than the above approach should not PS> produce duplicated labels. Unfortunately, this is not the case. PS> There are numbers, which remain different in signif(x, 16), but are mapped PS> to the same string in sprintf("%.*g", 16, x). Examples of this kind may be PS> found using the script PS> for (i in 1:50) { PS> x <- 10^runif(1, 38, 50) PS> y <- x * (1 + 0:500 * 1e-16) PS> y <- unique(signif(y, 16)) PS> z <- unique(sprintf("%.16g", y)) PS> stopifnot(length(y) == length(z)) PS> } PS> This script is tested on Intel default arithmetic and on Intel with SSE. PS> Perhaps, digitsLabels = 16 could be forbidden, if keepUnique is FALSE. PS> Unfortunately, a similar problem occurs even for digitsLabels = 15, although for PS> much larger numbers. PS> for (i in 1:200) { PS> x <- 10^runif(1, 250, 300) PS> y <- x * (1 + 0:500 * 1e-16) PS> y <- unique(signif(y, 15)) PS> z <- unique(sprintf("%.15g", y)) PS> stopifnot(length(y) == length(z)) PS> } PS> This script finds collisions, if SSE is enabled, on two PS> Intel computers, where i did the test. Without SSE, it PS> finds collisions only on one of them. May be, it depends PS> also on the compiler, which is different. probably rather on the exact implementation of the underlying C library ("LIBC"). Thank you, Petr, for your investigations. We all see that the simple requirement of *no more duplicate factor levels !* leads to considerable programming efforts for the case of factor(<numeric>, .). One prominent R-devel reader actually proposed to me in private, that factor(<numeric>, .) should give a *warning* by default, since he considered it unsafe practice. Note that your last investigations show that your (two) proposed changes actually do *not* solve the problem entirely; further note that (at least inside the sources), we now say that duplicate levels will not just signal a warning, but an error in the future. As long as we don't want to allow factor(<numeric>) to fail --rarely -- I think (and that actually has been a recurring daunting thought for quite a few days) that we probably need an extra step of checking for duplicate levels, and if we find some, recode "everything". This will blow up the body of the factor() function even more. What alternatives do you (all R-devel readers!) see? Martin ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel