On 27/04/2014, 10:16 AM, Hadley Wickham wrote:
Is there a reason it's a factor and not a string? A string would seem to be
more appropriate to me (given that we know it's a number that can't be
represented exactly by R)
The user asked that anything which can't be converted to a number should
be converted to a factor.
Yes, that's a bad default, but some people rely on it.
Duncan Murdoch
Hadley
On Saturday, April 26, 2014, Martin Maechler <maech...@stat.math.ethz.ch>
wrote:
Simon Urbanek <simon.urba...@r-project.org <javascript:;>>
on Sat, 19 Apr 2014 13:06:15 -0400 writes:
> On Apr 19, 2014, at 9:00 AM, Martin Maechler <
maech...@stat.math.ethz.ch <javascript:;>> wrote:
>>>>>>> McGehee, Robert <robert.mcge...@geodecapital.com<javascript:;>
>>>>>>> on Thu, 17 Apr 2014 19:15:47 -0400 writes:
>>
>>>> This is all application specific and
>>>> sort of beyond the scope of type.convert(), which now behaves as
it
>>>> has been documented to behave.
>>
>>> That's only a true statement because the documentation was changed
to reflect the new behavior! The new feature in type.convert certainly does
not behave according to the documentation as of R 3.0.3. Here's a snippit:
>>
>>> The first type that can accept all the
>>> non-missing values is chosen (numeric and complex return values
>>> will represented approximately, of course).
>>
>>> The key phrase is in parentheses, which reminds the user to expect
a possible loss of precision. That important parenthetical was removed from
the documentation in R 3.1.0 (among other changes).
>>
>>> Putting aside the fact that this introduces a large amount of
unnecessary work rewriting SQL / data import code, SQL packages, my biggest
conceptual problem is that I can no longer rely on a particular function
call returning a particular class. In my example querying stock prices,
about 5% of prices came back as factors and the remaining 95% as numeric,
so we had random errors popping in throughout the morning.
>>
>>> Here's a short example showing us how the new behavior can be
unreliable. I pass a character representation of a uniformly distributed
random variable to type.convert. 90% of the time it is converted to
"numeric" and 10% it is a "factor" (in R 3.1.0). In the 10% of cases in
which type.convert converts to a factor the leading non-zero digit is
always a 9. So if you were expecting a numeric value, then 1 in 10 times
you may have a bug in your code that didn't exist before.
>>
>>>> options(digits=16)
>>>> cl <- NULL; for (i in 1:10000) cl[i] <-
class(type.convert(format(runif(1))))
>>>> table(cl)
>>> cl
>>> factor numeric
>>> 990 9010
>>
>> Yes.
>>
>> Murray's point is valid, too.
>>
>> But in my view, with the reasoning we have seen here,
>> *and* with the well known software design principle of
>> "least surprise" in mind,
>> I also do think that the default for type.convert() should be what
>> it has been for > 10 years now.
>>
> I think there should be two separate discussions:
> a) have an option (argument to type.convert and possibly read.table)
to enable/disable this behavior. I'm strongly in favor of this.
In my (not committed) version of R-devel, I now have
> str(type.convert(format(1/3, digits=17), exact=TRUE))
Factor w/ 1 level "0.33333333333333331": 1
> str(type.convert(format(1/3, digits=17), exact=FALSE))
num 0.333
where the 'exact' argument name has been ``imported'' from the
underlying C code.
[ As we CRAN package writers know by now, arguments nowadays can
hardly be abbreviated anymore, and so I am not open to longer
alternative argument names, as someone liking blind typing, I'm
not fond of camel case or other keyboard gymnastics (;-) but if someone
has a great idea for
a better argument name.... ]
Instead of only TRUE/FALSE, we could consider NA with
semantics "FALSE + warning" or also "TRUE + warning".
> b) decide what the default for a) will be. I have no strong opinion,
I can see arguments in both directions
I think many have seen the good arguments in both directions.
I'm still strongly advocating that we value long term stability
higher here, and revert to more compatibility with the many
years of previous versions.
If we'd use a default of 'exact=NA', I'd like it to mean
FALSE + warning, but would not oppose much to TRUE + warning.
I agree that for the TRUE case, it may make more sense to return
string-like object of a new (simple) class such as "bignum"
that was mentioned in this thread.
OTOH, this functionality should make it into an R 3.1.1 in the
not so distant future, and thinking through consequences and
implementing the new class approach may just take a tad too much
time...
Martin
> But most importantly I think a) is better than the status quo - even
if the discussion about b) drags out.
> Cheers,
> Simon
______________________________________________
R-devel@r-project.org <javascript:;> mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel