Re: [Rd] suggestion for extending ?as.factor

Michael Dewey Sat, 09 May 2009 06:56:42 -0700

At 14:18 08/05/2009, Martin Maechler wrote:

>>>>> "PS" == Petr Savicky <savi...@cs.cas.cz>
>>>>>     on Fri, 8 May 2009 11:01:55 +0200 writes:

Somewhere below Martin asks for alternatives from list readers. I donot have alternatives, but I do have two comments, one immediatelybelow this, the other embedded in-line.

This whole thread reminds me just why I have spent the best part of adecade climbing the virtual Matterhorn called 'Learning R' and why itis such a pleasure to use. It is the fact that somebody, somewherecares enough about consistency, usability and accuracy to devotehours to getting even obscure details just right.

    PS> On Wed, May 06, 2009 at 10:41:58AM +0200, Martin Maechler wrote:
    PD> I think that the real issue is that we actually do want almost-equal
    PD> numbers to be folded together.
    >>

>> yes, this now (revision 48469) will happen by default,using signif(x, 15)

    >> where '15' is the default for the new optional argument 'digitsLabels'
    >> {better argument name? (but must nost start with 'label')}

PS> Let me analyze the current behavior of factor(x) fornumeric x with missing(levels)PS> and missing(labels). In this situation, levels are computedas sort(unique(x))PS> from possibly transformed x. Then, labels are constructedby a conversion of the

    PS> levels to strings.

PS> I understand the current (R 2.10.0, 2009-05-07 r48492)behavior as follows.


    PS> If keepUnique is FALSE (the default), then
    PS> - values x are transformed by signif(x, digitsLabels)
    PS> - labels are computed using as.character(levels)
    PS> - digitsLabels defaults to 15, but may be set to any integer value

    PS> If keepUnique is TRUE, then
    PS> - values x are preserved
    PS> - labels are computed using sprintf("%.*g", digitsLabels, levels)
    PS> - digitsLabels defaults to 17, but may be set to any integer value

(in theory; in practice, I think I've suggested somewhere that
 it should be  >= 17;  but see below.)

Your summary seems correct to me.

PS> There are several situations, when this approach producesduplicated levels.

    PS> Besides the one described in my previous email, there are also others
    PS> factor(c(0.3, 0.1+0.2), keepUnique=TRUE, digitsLabels=15)

yes, but this is not much sensical; I've already contemplated
to produce a warning in such cases, something like

   if(keepUnique && digitsLabels < 17)
     warning(gettextf(
     "'digitsLabels = %d' is typically too small when 'keepUnique' is true",
     digitsLabels))


    PS> factor(1 + 0:5 * 1e-16, digitsLabels=17)

again, this does not make much sense; but why disallow the useR
to shoot into his foot?

I agree. As a useR I do not want to be stopped from doing anything. Iwould appreciate a warning just before I shoot myself in the foot andI definitely want one if it looks like I am going to aim for my head.

PS> I would like to suggest a modification. It eliminates mostof the cases, wherePS> we get duplicated levels. It would eliminate all suchcases, if the functionPS> signif() works as expected. Unfortunately, if signif()works as it does in the

    PS> current versions of R, we still get duplicated levels.

    PS> The suggested modification is as follows.

    PS> If keepUnique is FALSE (the default), then
    PS> - values x are transformed by signif(x, digitsLabels)
    PS> - labels are computed using sprintf("%.*g", digitsLabels, levels)
    PS> - digitsLabels defaults to 15, but may be set to any integer value

I tend like this change, given -- as you found yesterday -- that
as.character() is not even preserving 15 digits.
OTOH,  as.character() has been in use for a very long history of
S (and R), whereas using sprintf() is not back compatible with
it and actually depends on the LIBC implementation of the system-sprintf.
For that reason as.character() would be preferable.
Hmm....

    PS> If keepUnique is TRUE, then
    PS> - values x are preserved
    PS> - labels are computed using sprintf("%.*g", 17, levels)
    PS> - digitsLabels is ignored

I had originally planned to do exactly the above.
However, e.g.,  digitsLabels = 18  may be desired in some cases,
and that's why I also left the possibility to apply it in the
keepUnique case.


    PS> Arguments for the modification are the following.

PS> 1. If keepUnique is FALSE, then computing labels usingas.character() leadsPS> to duplicated labels as demonstrated in my previous email.So, i suggest to

    PS> use sprintf("%.*g", digitsLabels, levels) instead of as.character().

{as said above, that seems sensible, though unfurtunately quite
 a bit less back-compatible!}

PS> 2. If keepUnique is TRUE and we allow digitsLabels lessthan 17, then we getPS> duplicated labels. So, i suggest to force digitsLabels=17,if keepUnique=TRUE.

PS> If signif(,digitsLabels) works as expected, than the aboveapproach should not

    PS> produce duplicated labels. Unfortunately, this is not the case.

PS> There are numbers, which remain different in signif(x, 16),but are mappedPS> to the same string in sprintf("%.*g", 16, x). Examples ofthis kind may be

    PS> found using the script

    PS> for (i in 1:50) {
    PS> x <- 10^runif(1, 38, 50)
    PS> y <- x * (1 + 0:500 * 1e-16)
    PS> y <- unique(signif(y, 16))
    PS> z <- unique(sprintf("%.16g", y))
    PS> stopifnot(length(y) == length(z))
    PS> }

PS> This script is tested on Intel default arithmetic and onIntel with SSE.

PS> Perhaps, digitsLabels = 16 could be forbidden, ifkeepUnique is FALSE.

PS> Unfortunately, a similar problem occurs even fordigitsLabels = 15, although for

    PS> much larger numbers.

    PS> for (i in 1:200) {
    PS> x <- 10^runif(1, 250, 300)
    PS> y <- x * (1 + 0:500 * 1e-16)
    PS> y <- unique(signif(y, 15))
    PS> z <- unique(sprintf("%.15g", y))
    PS> stopifnot(length(y) == length(z))
    PS> }

    PS> This script finds collisions, if SSE is enabled, on two
    PS> Intel computers, where i did the test. Without SSE, it
    PS> finds collisions only on one of them. May be, it depends
    PS> also on the compiler, which is different.

probably rather on the exact implementation of the underlying C
library ("LIBC").

Thank you, Petr, for your investigations.
We all see that the simple requirement of
   *no more duplicate factor levels !*
leads to considerable programming efforts for the case of
factor(<numeric>, .).

One prominent R-devel reader actually proposed to me in private,
that  factor(<numeric>, .)  should give a *warning* by default,
since he considered it unsafe practice.

Note that your last investigations show that your (two) proposed
changes actually do *not* solve the problem entirely;
further note that (at least inside the sources), we now say that
duplicate levels will not just signal a warning, but an error in
the future.
As long as we don't want to allow  factor(<numeric>) to fail --rarely --
I think (and that actually has been a recurring daunting thought
for quite a few days) that we probably need an
extra step of checking for duplicate levels, and if we find
some, recode "everything". This will blow up the body of the
factor() function even more.

What alternatives do you (all R-devel readers!) see?

Martin

______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Michael Dewey
http://www.aghmed.fsnet.co.uk

______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] suggestion for extending ?as.factor

Reply via email to