At the same time there are a lot of uses in R where the vector really
is just an id, not something you would regress on.  I often work on
data frames with 10 million rows  and 1 million ids, having that as a
factor is just downright wasteful and slow.

I think the global option is the best compromise here.

On Tue, Apr 17, 2012 at 11:37 AM, Joseph Voelkel <[email protected]> wrote:
> Statistical work in R (what many us of do, I'd say) prefers factors over 
> characters:
>
> # factors
> DF <- data.frame(X=letters[rep(1:5,2)], Y=rnorm(10))
> (DF.lm<-lm(Y~X,DF))
> predict(DF.lm)
> # all works fine
>
> # characters
> DF <- data.frame(X=letters[rep(1:5,2)], Y=rnorm(10), stringsAsFactors=FALSE)
> (DF.lm<-lm(Y~X,DF)) # warning
> predict(DF.lm) # warning
>
> Not sure if this one will get resolved.
>
> Using factors instead of characters also ensures that a table of months or 
> days of the week can be listed in the natural (not alphabetic) ordering.
>
> Joe
>
> -----Original Message-----
> From: [email protected] 
> [mailto:[email protected]] On Behalf Of 
> Matthew Dowle
> Sent: Sunday, April 15, 2012 1:20 PM
> To: Damian Betebenner
> Cc: [email protected]
> Subject: Re: [datatable-help] Coercian to character
>
> I thought I'd added something to FAQ 2.17 about that, but seems not.
> Will add, thanks. Maybe I only wrote it up in the comment when closing the 
> related feature request. It's deliberately different since my guess is that 
> most people most of the time (now) want characters left as characters and 
> keep setting stringsAsFactors to FALSE. Think the default for data.frame was 
> TRUE as a hang over from old versions of R before the global string cache was 
> added.
>
> It's not set in stone though so could be changed. In particular there could 
> be global default like we've done for other arguments so you could change the 
> default if need be.
>
> It won't cause a compatibility issue (same as other differences in faq
> 2.17) or any issues down the road as far I can think, but let me know if you 
> think of anything.
>
> Matthew
>
> On Sun, 2012-04-15 at 04:40 -0500, Damian Betebenner wrote:
>> I started having character vectors popping up in places I never had before 
>> but upon further investigation that turned out to be an issue with my own 
>> setup, not data.table.
>>
>> With regard to characters (and data.tables ability to handle them as a
>> key now), I did notice that data.table and data.frame default to using 
>> stringsAsFactors differently:
>>
>> DF <- data.frame(X=letters[1:10], Y=rnorm(10)) sapply(DF, class)
>>
>>         X         Y
>>  "factor" "numeric"
>>
>> DT <- data.table(X=letters[1:10], Y=rnorm(10)) sapply(DT, class)
>>
>> > DT <- data.table(X=rep(letters[1:10], each=2), Y=rnorm(20))
>> > sapply(DT, class)
>>           X           Y
>> "character"   "numeric"
>>
>>
>> Will this inconsistency cause problems down the road?
>>
>> Thanks for all your help,
>>
>> Damian
>>
>>
>> Damian Betebenner
>> Center for Assessment
>> PO Box 351
>> Dover, NH   03821-0351
>>
>> Phone (office): (603) 516-7900
>> Phone (cell): (857) 234-2474
>> Fax: (603) 516-7910
>>
>> [email protected]
>> www.nciea.org
>>
>>
>>
>>
>> -----Original Message-----
>> From: Matthew Dowle [mailto:[email protected]] On Behalf
>> Of Matthew Dowle
>> Sent: Thursday, April 12, 2012 5:50 PM
>> To: Damian Betebenner
>> Cc: [email protected]
>> Subject: Re: [datatable-help] Coercian to character
>>
>> It shouldn't coerce. What makes you think it does?
>>
>> > DT = data.table(a=factor(c("a","b","b","c")),b=1:4)
>> > DT[,sum(b),by=a]
>>      a V1
>> [1,] a  1
>> [2,] b  5
>> [3,] c  4
>> > str(DT[,sum(b),by=a])
>> Classes ‘data.table’ and 'data.frame':        3 obs. of  2 variables:
>>  $ a : Factor w/ 3 levels "a","b","c": 1 2 3  $ V1: int  1 5 4
>>
>>
>>
>> On Thu, 2012-04-12 at 14:57 -0500, Damian Betebenner wrote:
>> > Data tablers
>> >
>> >
>> >
>> > Does data.table now coerce factors to character variables when doing
>> > by summaries?
>> >
>> >
>> >
>> > If so, is there any way to not allow this coercion?
>> >
>> >
>> >
>> > Thanks,
>> >
>> >
>> >
>> > Damian Betebenner
>> >
>> > Center for Assessment
>> >
>> > PO Box 351
>> >
>> > Dover, NH   03821-0351
>> >
>> >
>> >
>> > Phone (office): (603) 516-7900
>> >
>> > Phone (cell): (857) 234-2474
>> >
>> > Fax: (603) 516-7910
>> >
>> >
>> >
>> > [email protected]
>> >
>> > www.nciea.org
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> > _______________________________________________
>> > datatable-help mailing list
>> > [email protected]
>> > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatab
>> > le
>> > -help
>>
>>
>
>
> _______________________________________________
> datatable-help mailing list
> [email protected]
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
> _______________________________________________
> datatable-help mailing list
> [email protected]
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
_______________________________________________
datatable-help mailing list
[email protected]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help

Reply via email to