Thanks, makes sense. Yes, as.data.frame.data.table currently removes the 'sorted' attribute, which is all a key is. I suppose that line could be removed so the key would be left on the data.frame. You would then need to change the class back to data.table at the end of the function, though, and make sure you didn't change the order of the rows otherwise that key would be invalid.
However, packages I use, use other packages I don't use directly and know nothing about. I don't see the issue. Disk space? Memory space? The banner? There is also this related FR : https://r-forge.r-project.org/tracker/index.php?func=detail&aid=984&group_id=240&atid=978 Just to check you know that the result of j in data.table can happily be a data.frame? So if your user is using data.table to call your function, he won't mind. If he's passing the entire data.table to your function, then he's not going to be wanting to retain the key anyway. You're returning some statistical result to him (not the orginal data back) so why does the key make sense to retain? The functional idiom you're showing is one of the things I don't like about data.frame in R. It's one of the reasons the syntax in data.table is different. I'll translate it with comments to what really happens : MyFunc <- function(data, numerator.var, denominator.var) { data <- data[order( # reorder all columns of the data data[,numerator.var])] # copy one column to a new vector data$metric <- # copy all 'data' (doesn't just add a column) data[, numerator.var] # new copy of vector /data[, denominator.var] # new copy of vector data$cum.metric <- # copy all of 'data' again, and # lock user into your choice of column name cumsum(data$metric) # new copy of metric vector (not sure) return(data) # finally, gosh, I'm worn out after all that } contrast to this : MyFunc <- function(numerator,denominator) { o = order(numerator) cumsum((numerator/denomintor)[o]) # oh, that's what it does! } A data.table user would call the latter like this : DT[,MyFunc(colA,colB)] So there aren't any copies of the columns going on because colA and colB are vectors right there, and it's much faster. Or the user can do : DT[,MyFunc(colA,colB),by=grp] and that saves you adding a grouping variable to MyFunc. Or, if MyFunc is already locked into accepting a data.frame, the data.table user can (and does) use it like this : DT[,MyFunc(data=.SD,"colA","colB"),by=grp] and it doesn't matter that the j comes back as data.frame, that's still a list which is fine to j. Obviously it's less efficient of course, because the data is being copied and added to, but the inefficiency is up in MyFunc. The data.table user might decide to take MyFunc, chop out all the innefficiency and just keep the bits it really does. Noticing that strictly, your MyFunc 'returned' two columns, so it might be written like this : MyFunc <- function(numerator,denominator) { o = order(numerator) data.frame(numerator[o], cumsum((numerator/denomintor)[o]) } Then the user can decide if he wants to cbind it to his data.frame, or fast assign it into a data.table, or by group, or whatever. That seems to me to be up to your user. Perhaps, the job of MyFunc is to return it's output given the input (and that's all). Writing quickly, probably with errors and typos. There are many ways to do things, and above is just one way. Maybe a more complicated example from you is needed please, for me to see. My main concern is effiency on large datasets; passing the large dataset into a function for it to be copied and copied, just isn't a good idiom as far I can see. That's why in data.table the idea is to pass functions the columns themselves within the scope of the data.table i.e. call the function in j. Matthew > Mainly it is that I am writing some library functions that I and a few > others may be using. I don't want those functions to have to depend on > data.table because I don't want it to need to be installed for a purpose > that has nothing to do with it. But I use data.tables as input. Here is a > psuedo example > > MyFunc <- function(data, numerator.var, denominator.var) > { > data <- data[order(data[,numerator.var])] > data$metric <- data[, numerator.var] / data[, denominator.var] > data$cum.metric <- cumsum(data$metric) > > return(data) > } > > I make this example to show that I need to preserve the whole data > variable > the whole way through and return a modified version. If I do > > data <- as.data.frame(data) > > as the first line of that function, then I lose the keys in a potential > data.table that is passed in. If I use > > data <- as.data.table(data) > > and change the subsetting to be data.table compliant, then I am forcing > someone to have a whole package loaded for something that can be done in > the > base language fine. There must be an agnostic way to do this. Apparently > subset doesn't do it either if keys get lost. > > -Chris > > On 20 July 2011 08:48, Matthew Dowle <[email protected]> wrote: > >> >> Hi Chris, >> >> If you're writing a package and don't want to worry if someone passes >> your >> package a data.table, then don't worry; just use data.frame syntax and >> your non-datatable-aware package will work fine. >> >> If you're writing your own code you're in control of, just embrace the >> data.table ;) >> >> If you're writing a function in an environment which is data.table >> aware, >> but you want your function to accept either data.frame or data.table, >> then >> at the beginning of your function just do : >> >> f = myfunction(x) { >> x = as.data.table(x) >> # proceed with data.table syntax >> } >> >> or >> >> f = myfunction(x) { >> x = as.data.frame(x) >> # proceed with data.frame syntax >> } >> >> Some of the CRAN packages that depend on data.table are doing that, I >> think. >> >> In R itself it is common practice to coerce arguments to a common type >> and >> then proceed with the appropriate syntax for that type. Consider that >> matrix syntax is different syntax to data.frame syntax. You often see >> as.classiwant() at the beginning of functions, or switches depending on >> the type of object. >> >> Remember that is.data.frame() is TRUE for both data.frame and >> data.table, >> but is.data.table() is TRUE only for data.table. as.data.table() does >> nothing if x is already a data.table, and is an efficient class change >> if >> x is a data.frame. Is efficiency the issue? >> >> Does that help? If not, more info about the problem will be needed >> please. >> >> Matthew >> >> >> > I'm used to seeing the column names at the bottom of the column too, >> but >> > that is only if the data.table is long enough. My example was too >> short >> > for >> > that, so I made the same sort of mistake you did :( >> > >> > Okay, that is a way, but is it a good way? Not sure... >> > >> > 2011/7/20 Timothée Carayol <[email protected]> >> > >> >> Sorry my mistake -- subset does return a data.table. >> >> (I was using as an example a data.table with 100 rows, and stupidly >> >> using >> >> the fact that it printed the whole thing rather than the 10 first >> rows >> >> only >> >> as my criterion for whether it worked or not.. Omitting that >> >> print.data.table does print up to 100 rows. I feel a bit stupid.) >> >> >> >> Why doesn't it work for you if that is the case? >> >> >> >> DF <- data.frame(a=1:200, b=1:10) >> >> DT <- as.data.table(DF) >> >> subDT <- subset(DT, select=a) >> >> class(DT) >> >> subDF <- subset(DF, select=a) >> >> class(DF) >> >> identical(as.data.frame(DT), DF) >> >> >> >> >> >> >> >> On Wed, Jul 20, 2011 at 12:50 PM, Chris Neff <[email protected]> >> wrote: >> >> >> >>> Yeah I realized that myself. >> >>> >> >>> Another one: the function "with" doesn't seem to do what I want... >> but >> >>> at >> >>> least it is consistent! >> >>> >> >>> >> >>> 2011/7/20 Timothée Carayol <[email protected]> >> >>> >> >>>> Sorry -- >> >>>> >> >>>> subset() was a poor idea, as it will return a data.frame even if >> the >> >>>> argument is a data.table.. >> >>>> >> >>>> >> >>>> >> >>>> 2011/7/20 Timothée Carayol <[email protected]> >> >>>> >> >>>>> Hi-- >> >>>>> >> >>>>> You can use the subset() command with the select= option; not sure >> >>>>> it's >> >>>>> the best solution, though. >> >>>>> >> >>>>> Timothee >> >>>>> >> >>>>> >> >>>>> On Wed, Jul 20, 2011 at 12:26 PM, Chris Neff <[email protected]> >> >>>>> wrote: >> >>>>> >> >>>>>> I have a function where I pass a data frame and some variable >> names >> >>>>>> to >> >>>>>> calculate statistics on. However, I am at a loss as to how to >> write >> >>>>>> it >> >>>>>> correctly so that both data.frame and data.table work with it. If >> I >> >>>>>> have: >> >>>>>> >> >>>>>> DF = data.frame(x=1:10,y=2:11,z=3:12) >> >>>>>> >> >>>>>> DT = data.table(DF) >> >>>>>> >> >>>>>> var.names = c("x","y") >> >>>>>> >> >>>>>> >> >>>>>> I can do the following things to subset: >> >>>>>> >> >>>>>> DT[,var.names,with=FALSE] >> >>>>>> DF[,var.names] >> >>>>>> >> >>>>>> >> >>>>>> but of course DT[,var.names] won't give me back what I want, and >> >>>>>> DF[,var.names,with=FALSE] returns an error because with doesn't >> >>>>>> exist there. >> >>>>>> So how do I do this? >> >>>>>> >> >>>>>> Thanks, >> >>>>>> -Chris >> >>>>>> >> >>>>>> >> >>>>>> >> >>>>>> _______________________________________________ >> >>>>>> datatable-help mailing list >> >>>>>> [email protected] >> >>>>>> >> >>>>>> >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >> >>>>>> >> >>>>>> >> >>>>> >> >>>> >> >>> >> >> >> > _______________________________________________ >> > datatable-help mailing list >> > [email protected] >> > >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >> > >> >> >> > _______________________________________________ datatable-help mailing list [email protected] https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
