Matthew,
Thanks for the tips and comprehensive explanation! I like the set() approach, it's not very R-like but more elegant than eval() and get() -- I believe most data.table users might feel like the .SD option is more intuitive though (and many R users would tend to shy away from using loops)? I wonder if there's a rational for using set() behind the scenes, once the .SD-without-by feature is implemented... By the way I also vote in favor of the ":= with by" feature, that seems like a worthwhile addition to data.table. Again I'm finding that syntax very intuitive. --Mel. On 2012-04-17 08:53, Matthew Dowle wrote: > Hi, > > How about this : > > http://stackoverflow.com/questions/7235657/fastest-way-to-replace-nas-in-a-large-data-tableSkip to the end marked EDIT and reversing 0 and NA for your case: > > DT = data.table(a=c(1L,0L,3L),b=c(0L,5:6),c=c(7:8,0L)) > a b c > [1,] 1 0 7 > [2,] 0 5 8 > [3,] 3 6 0 > > for (i in names(DT)[2:3]) > DT[get(i)==0L,i:=NA_integer_,with=FALSE] > >> DT > > a b c > [1,] 1 NA 7 > [2,] 0 5 8 > [3,] 3 6 NAIf you have 000's of columns, then this sort of thing is just what the new > set() is for, to avoid the overhead of repeatedly calling [.data.table, > e.g. : > > for (i in 2:3) > set(DT,which(DT[[i]]==0L),i,NA_integer_) > > Without the 'which' you get an error "i is type 'logical'. Must be > integer, or numeric is coerced with warning.". I'll add to that message > something like "logical isn't accepted as i for speed since set() is > intended for inside loops; checking and coercing logical to integer row > positions takes time. Wrap logical i with which() if required". > > One reason for liking for() loops with data.table is that working on one > column at a time makes sense for large tables (less working memory > needed). I thought about constructing a list() RHS of := with a vector of > column names on the LHS of :=, but that approach doesn't scale as the > number of columns grows, due to needed space for the entire RHS. The > for() loop above is much better for that reason. [Multiple LHS of := is > more for when the RHS is a single value repeated for all the columns, such > as 0L or NA, or, NULL to delete multiple columns in one step.] > > Finally, the .SD approach should work when bug #1732 is fixed (".SD, .N > and .BY should be available when by="" and by=NULL"), but still not as > efficient as the for loop above. > > Matthew > >> Dear all, I have a large data.table and I am simply trying to replace all zeros with NAs in a subset of columns. So I tried first: >> >>> dt[, lapply(.SD, function(x) x := ifelse(x==0, NA, x)), .SDcols=3:30] >> Error in lapply(.SD, function(x) `:=`(x, ifelse(x == 0, NA, x))) : object '.SD' not found Clearly that's not the right approach... Then I tried: >> >>> for (i in names(dt)[3:30]) { >> eval(parse(text=paste("dt[`", i, "`==0, `", i, "` := NA]", sep=""))) } That worked but is rather ugly. would you recommend any better way to avoid the eval(parse()) to perform such simple tasks? Thanks in advance, --Mel. _______________________________________________ datatable-help mailing list [email protected] [1] https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help [2] Links: ------ [1] mailto:[email protected] [2] https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
_______________________________________________ datatable-help mailing list [email protected] https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
