Re: [Rd] stringsAsFactors

Brian Diggs Mon, 11 Feb 2013 12:16:51 -0800

On 2/11/2013 5:50 AM, Terry Therneau wrote:

I think your idea to remove the warnings is excellent, and a good
compromise.  Characters already work fine in modeling functions except
for the silly warning.


It is interesting how often the defaults for a program reflect the data
sets in use at the time the defaults were chosen.  There are some such
in my own survival package whose proper value is no longer as "obvious"
as it was when I chose them.  Factors are very handy for variables which
have only a few levels and will be used in modeling.  Every character
variable of every dataset in "Statistical Models in S", which introduced
factors, is of this type so auto-transformation made a lot of sense.
The "solder" data set there is one for which Helmert contrasts are
proper so guess what the default contrast option was?  (I think there
are only a few data sets in the world for which Helmert makes sense,
however, and R eventually changed the default.)

For character variables that should not be factors such as a street
adress stringsAsFactors can be a real PITA, and I expect that people's
preference for the option depends almost entirely on how often these
arise in their own work.  As long as there is an option that can be
overridden I'm okay.  Yes, I'd prefer FALSE as the default, partly
because the current value is a tripwire in the hallway that eventually
catches every new user.

I also agree that stringsAsFactors should not be TRUE, at least bydefault. I do not change the default in my .Rprofile because I have seenexamples where people have gotten tripped up having changed this andforgotten about it or when sharing code and getting different,unexpected, results. However, my code is littered with this additionalargument so that I get, to me, the more sensible behavior.

My preference follows from my conceptualization of what a factor is. Tome, a factor is the representation of a data type which has a fixed,finite, set of values which it can take which are known a priori. Interms of sample and population, a variable could only be a factor if allpossible values that the variable could take in the population are known(not just those in the given sample). Automatic conversion of strings tofactors assumes that the values that are present constitute the completeand exclusive set of values which that variable could ever have, anassumption which is often not correct in my experience. Examples such asnames, street addresses, or unique alphanumeric identifiers all fit thiscriteria.

In contrast, a character variable is just vector of arbitrary lengthcharacter strings; it makes no further assumptions.

A secondary reason why I don't like a default conversion of strings tofactors is, on importing data, I often have to do some data cleaning(unifying case, noting specific missing value encoding, collapsingredundant entries) before I have a clean set of possible values toconvert to a factor. Once I have converted those variables which shouldbe factors to factors and left those that are just character strings ascharacter strings, I don't want later functions changing those choiceson me.

I realize that, historically, a factor was also a more efficient storagemechanism for strings (store each unique string only once and thenrecord an index to that string), but with the global string table, myunderstanding is that that is no longer the case.

Finally, stringsAsFactors being TRUE by default effectively says thatthe is no place for character vectors; all character vectors should beconverted to factors as soon as possible. Take this to the (absurd)extreme, why even have a character vector type, then? The (appropriate)existence of both a factor type and character vector type is a furtherargument that the latter should no be converted to the former automatically.

Terry Therneau

On 02/11/2013 05:00 AM, r-devel-requ...@r-project.org wrote:

Both of these were discussed by R Core.  I think it's unlikely the
default for stringsAsFactors will be changed (some R Core members like
the current behaviour), but it's fairly likely the show.signif.stars
default will change.  (That's if someone gets around to it:  I
personally don't care about that one.  P-values are commonly used
statistics, and the stars are just a simple graphical display of them.
I find some p-values to be useful, and the display to be harmless.)

I think it's really unlikely the more extreme changes (i.e. dropping
show.signif.stars completely, or dropping p-values) will happen.

Regarding stringsAsFactors:  I'm not going to defend keeping it as is,
I'll let the people who like it defend it.  What I will likely do is
make a few changes so that character vectors are automatically changed
to factors in modelling functions, so that operating with
stringsAsFactors=FALSE doesn't trigger silly warnings.



--
Brian S. Diggs, PhD
Senior Research Associate, Department of Surgery
Oregon Health & Science University

______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] stringsAsFactors

Reply via email to