On Tue, 20 Jan 2009, Stavros Macrakis wrote:

I'm rather confused by the semantics of factors.

<snip actual confusion>

It is all very confusing.  Of course, most of this behavior is
documented and is easily determined by experimentation, but it would
be easier to learn and teach the language if there were some clear
principle underlying all this.  What am I missing?


No, it really is confusing. The problem is that there are two conflicting clear 
principles. Factors could be

 - integer variables with labels (similar to value labels in Stata/SPSS or C 
enums)
 - variables that takes on values from a pre-specified set, implemented using 
integer codes (like Pascal enumerated types).

[In fact, there was historically even a third way to view factors, as way to 
reduce the memory use of string variables. That's obsolete now.]

That is, the fact that they are small integers can be seen as part of the 
interface or just as part of the implementation.  It's obvious which one is 
right, but unfortunately it is differently obvious to different people.

AFAIK there has never been a unified policy on this, dating back before R, so 
different functions behave differently.  There have been changes in R over the 
years, mostly in the direction of making factors more like Pascal enumerations.

     -thomas

Thomas Lumley                   Assoc. Professor, Biostatistics
tlum...@u.washington.edu        University of Washington, Seattle

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to