> From: r-devel-boun...@r-project.org > [mailto:r-devel-boun...@r-project.org] On Behalf Of Matthew Dowle > Sent: Friday, February 05, 2010 11:17 AM > To: r-de...@stat.math.ethz.ch > Subject: Re: [Rd] Why is there no c.factor? > > > > concat() doesn't get a lot of use > How do you know? Maybe its used a lot but the users had no > need to tell you > what they were using. The exact opposite might in fact be the > case i.e. > because concat is so good in splus, you just never hear of > problems with it > from the users. That might be a very good sign.
We don't use concat in many of our functions. It tends to be used only where c fails. It is slower than c(), in part because it is an SV4 generic while c is a .Internal (the fastest S+ interface to C code). concat() is also written entirely in S code, with calls to heavyweights like sapply. Writing it in C would speed it up a lot. > sys.time(for(i in 1:10000)c(1,2)) [1] 0.27 0.27 > sys.time(for(i in 1:10000)concat(1,2)) [1] 20.29 20.29 > sys.time(for(i in 1:10000)concat.two(1,2)) [1] 0.52 0.52 The last just calls the default method of concat.two, which is a call to c(). > > > perhaps that model would work well for a concatenation function in R > I'd be happy to test it. I'm a bit concerned about > performance though given > what you said about repeated recursive calls, and dispatch. > Could you run > the following test in s-plus please and post back the timing? > If this small > 100MB example was fine, then we could proceed to a 64bit 10GB > test. This is > quite nippy at the moment in R (1.1sec). I'd be happy with a > better way as > long as speed wasn't compromised. > > set.seed(1) > L = as.vector(outer(LETTERS,LETTERS,paste,sep="")) # > union set of 676 > levels > F = lapply(1:100, function(i) > { # create 100 factors > f = sample(1:100, 1*1024^2 / 4, replace=TRUE) > # each factor > 1MB large (262144 integers), plus small amount for the levels > levels(f) = sample(L,100) > # pick 100 levels from the union set > class(f) = "factor" > f > }) > > > head(F[[1]]) > [1] RT DM CO JV BG KU > 100 Levels: YC FO PN IL CB CY HQ ... > > head(F[[2]]) > [1] RK PD FE SG SJ CQ > 100 Levels: JV FV DX NL XB ND CY QQ ... > > > > With c.factor from data.table, as posted, placed in .GlobalEnv > > > system.time(G <- do.call("c",F)) > user system elapsed > 0.81 0.32 1.12 > > head(G) > [1] RT DM CO JV BG KU # looks right, comparing to F[[1]] above > 676 Levels: AA AB AC AD AE AF AG AH AI AJ AK AL AM AN AO AP > AQ AR AS AT AU > AV AW AX AY AZ BA BB BC BD BE BF ... ZZ > > G[262145:262150] > [1] RK PD FE SG SJ CQ # looks right, comparing to > F[[2]] above > 676 Levels: AA AB AC AD AE AF AG AH AI AJ AK AL AM AN AO AP > AQ AR AS AT AU > AV AW AX AY AZ BA BB BC BD BE BF ... ZZ > > identical(as.character(G),as.character(unlist(F))) > [1] TRUE > > So I guess this would be compared to following in splus ? > > system.time(G <- do.call("concat", F)) > > or maybe its just the following : > > system.time(G <- concat(F)) > > I don't have splus so I can't test that myself. > > > "William Dunlap" <wdun...@tibco.com> wrote in message > news:77eb52c6dd32ba4d87471dcd70c8d7000275b...@na-pa-vbe03.na.t ibco.com... > > -----Original Message----- > > From: r-devel-boun...@r-project.org > > [mailto:r-devel-boun...@r-project.org] On Behalf Of Peter Dalgaard > > Sent: Friday, February 05, 2010 7:41 AM > > To: Hadley Wickham > > Cc: John Fox; r-devel@r-project.org; Thomas Lumley > > Subject: Re: [Rd] Why is there no c.factor? > > > > Hadley Wickham wrote: > > > On Thu, Feb 4, 2010 at 12:03 PM, Hadley Wickham > > <had...@rice.edu> wrote: > > >>> I'd propose the following: If the sets of levels of all > > arguments are the > > >>> same, then c.factor() would return a factor with the > > common set of levels; > > >>> if the sets of levels differ, then, as Hadley suggests, > > the level-set of the > > >>> result would be the union of sets of levels of the > > arguments, but a warning > > >>> would be issued. > > >> I like this compromise (as long as there was an argument > > to suppress > > >> the warning) > > > > > > If I provided code to do this, along with the warnings for ordered > > > factors and using the optimisation suggested by Matthew, is > > there any > > > member of R core would be interested in sponsoring it? > > > > > > Hadley > > > > > > > Messing with c() is a bit unattractive (I'm not too happy > > with the other > > c methods either; normally c() strips attributes and reduces > > to the base > > class, and those obviously do not), but a more general > > concat() function > > has been suggested a number of times. With a suitable range > > of methods, > > this could also be used to reimplement rbind.data.frame (which, > > incidentally, already contains a method for concatenating > > factors, with > > several ugly warts!) > > Yes, c() should have been put on the deprecated list a couple > of decades ago, since people expect it to do too many > incompatible things. And factor should have been a virtual > class, with subclasses "FixedLevels" (e.g., Sex) or "AdHocLevels" > (e.g., FamilyName), so c() and [()<- could do the appropriate > thing in either case. > > Back to reality, S+ has a concat(...) function, whose comments say > # This function works like c() except that names of arguments are > # ignored. That is, it concatenates its arguments into a single > # S vector object, without considering the names of the arguments, > # in the order that the arguments are given. > # > # To make this function work for new classes, it is only necessary > # to make methods for the concat.two function, which concatenates > # two vectors; recursion will take care of the rest. > concat() is not generic but it repeatedly calls concat.two(x,y), an > SV4-generic that dispatches on the classes of x and y. Thus you > can easily predict the class of concat(x,y,z), although it may not > be the same as the class of concat(z,y,x), given suitably bizarre > methods for concat.two(). > > concat() doesn't get a lot of use but I think the idea is sound. > Perhaps that model would work well for a concatenation function in R. > > Bill Dunlap > Spotfire, TIBCO Software > wdunlap tibco.com > > > > > -- > > O__ ---- Peter Dalgaard Ă˜ster Farimagsgade 5, Entr.B > > c/ /'_ --- Dept. of Biostatistics PO Box 2099, 1014 Cph. K > > (*) \(*) -- University of Copenhagen Denmark Ph: > > (+45) 35327918 > > ~~~~~~~~~~ - (p.dalga...@biostat.ku.dk) FAX: > > (+45) 35327907 > > > > ______________________________________________ > > R-devel@r-project.org mailing list > > https://stat.ethz.ch/mailman/listinfo/r-devel > > > > ______________________________________________ > R-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel > ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel