Re: [R] Confusing behaviour in data.table: unexpectedly changing variable
Very sorry to hear this bit you. If you need a copy of names before changing them by reference : oldnames - copy(names(DT)) This will be documented and it's on the bug list to do so. copy is needed in other circumstances too, see ?copy. More details here : http://stackoverflow.com/questions/18662715/colnames-being-dropped-in-data-table-in-r http://stackoverflow.com/questions/15913417/why-does-data-table-update-namesdt-by-reference-even-if-i-assign-to-another-v Btw, the r-help posting guide says (last time I looked) you should only post to r-help about packages if you have tried the maintainer first but didn't hear from them; i.e., r-help isn't for support about packages. I don't follow r-help, so please continue to cc me if you reply. Matthew On 25/09/13 00:47, Jonathan Dushoff wrote: I got bitten badly when a variable I created for the purpose of recording an old set of names changed when I didn't think I was going near it. I'm not sure if this is a desired behaviour, or documented, or warned about. I read the data.table intro and the FAQ, and also ?setnames. Ben Bolker created a minimal reproducible example: library(data.table) DT = data.table(x=rep(c(a,b,c),each=3), y=c(1,3,6), v=1:9) names(DT) ## [1] x y v oldnames - names(DT) print(oldnames) ## [1] x y v setnames(DT, LETTERS[1:3]) print(oldnames) ## [1] A B C __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Problem with R CMD check and the inconsolata font business
On 11/3/2011 3:30 PM, Brian Diggs wrote: Well, I figured it out. Or at least got it working. I had to run initexmf --mkmaps because apparently there was something wrong with my font mappings. I don't know why; I don't know how. But it works now. I think installing the font into the Windows Font directory was not necessary. I'm including the solution in case anyone else has this problem. Many thanks Brian Diggs! I just had the same problem and that fixed it. Matthew __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] data.table vs plyr reg output
Hi Geoff, Please see this part of the r-help posting guide : For questions about functions in standard packages distributed with R (see the FAQ Add-on packages in R), ask questions on R-help. If the question relates to a contributed package , e.g., one downloaded from CRAN, try contacting the package maintainer first. You can also use find(functionname) and packageDescription(packagename) to find this information. ONLY send such questions to R-help or R-devel if you get no reply or need further assistance. This applies to both requests for help and to bug reports. Where I've capitalised ONLY since it is bold in the original HTML. I only saw your post thanks to Google Alerts. maintainer(data.table) returns the email address of the datatable-help list, with the posting guide in mind. However, for questions like this, I'd suggest the data.table tag on Stack Overflow (which I subscribe to) : http://stackoverflow.com/questions/tagged/data.table Btw, I recently presented at LondonR. Here's a link to the slides : http://datatable.r-forge.r-project.org/LondonR_2012.pdf Matthew -- View this message in context: http://r.789695.n4.nabble.com/data-table-vs-plyr-reg-output-tp4634518p4634865.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] how to convert list of matrix (raster:extract o/p) to data table with additional colums (polygon Id, class)
AKJ, Please see this recent answer : http://r.789695.n4.nabble.com/data-table-vs-plyr-reg-output-tp4634518p4634865.html Matthew -- View this message in context: http://r.789695.n4.nabble.com/how-to-convert-list-of-matrix-raster-extract-o-p-to-data-table-with-additional-colums-polygon-Id-cla-tp4634579p4634868.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] SLOW split() function
Using Josh's nice example, with data.table's built-in 'by' (optimised grouping) yields a 6 times speedup (100 seconds down to 15 on my netbook). system.time(all.2b - lapply(si, function(.indx) { coef(lm(y ~ + x, data=d[.indx,])) })) user system elapsed 144.501 0.300 145.525 system.time(all.2c - lapply(si, function(.indx) { minimal.lm(y + = d[.indx, y], x = d[.indx, list(int, x)]) })) user system elapsed 100.819 0.084 101.552 system.time(all.2d - d[,minimal.lm2(y=y, x=cbind(int, x)),by=key]) user system elapsed 15.269 0.012 15.323 # 6 times faster head(all.2c) $`1` coefse x1 0.5152438 0.6277254 x2 0.5621320 0.5754560 $`2` coef se x1 0.2228235 0.312918 x2 0.3312261 0.261529 $`3` coefse x1 -0.1972439 0.4674000 x2 -0.1674313 0.4479957 $`4` coefse x1 -0.13915746 0.2729158 x2 -0.03409833 0.2212416 $`5` coefse x1 0.007969786 0.2389103 x2 -0.083776526 0.2046823 $`6` coefse x1 -0.58576454 0.5677619 x2 -0.07249539 0.5009013 head(all.2d) key coefV2 [1,] 1 0.5152438 0.6277254 [2,] 1 0.5621320 0.5754560 [3,] 2 0.2228235 0.3129180 [4,] 2 0.3312261 0.2615290 [5,] 3 -0.1972439 0.4674000 [6,] 3 -0.1674313 0.4479957 minimal.lm2 # slightly modified version of Josh's function(y, x) { obj - lm.fit(x = x, y = y) resvar - sum(obj$residuals^2)/obj$df.residual p - obj$rank R - .Call(La_chol2inv, x = obj$qr$qr[1L:p, 1L:p, drop = FALSE], size = p, PACKAGE = base) m - min(dim(R)) d - c(R)[1L + 0L:(m - 1L) * (dim(R)[1L] + 1L)] se - sqrt(d * resvar) list(coef = obj$coefficients, se) } -- View this message in context: http://r.789695.n4.nabble.com/SLOW-split-function-tp3892349p3900851.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] How to map current Europe?
Hi Uwe, When you cc from Nabble it doesn't show as cc'd on r-help. It's a web form with an Email this post to... box. I asked Nabble support (over a year ago) if they could reflect that in the cc field of the post they send to r-help, with no luck. The previous thread is cited automatically in the footer: View this message in context link. I'm replying to this one because I happened to used Nabble to reply in another thread, in the same way, earlier this morning. If it isn't ok to post from Nabble, it's an option to prevent posting from Nabble I believe. To double check, I've sent this reply using Nabble. Did you get the (unreflected) cc? I placed your email address in the email this post to... box. Matthew -- View this message in context: http://r.789695.n4.nabble.com/How-to-map-current-Europe-tp3715709p3900971.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] fast or space-efficient lookup?
Ivo, Also, perhaps FAQ 2.14 helps : Can you explain further why data.table is inspired by A[B] syntax in base? http://datatable.r-forge.r-project.org/datatable-faq.pdf And, 2.15 and 2.16. Matthew Steve Lianoglou mailinglist.honey...@gmail.com wrote in message news:CAHA9McPQ4P-a2imjm=szgjfxyx0faw0j79fwq2e87dqkf9j...@mail.gmail.com... Hi Ivo, On Mon, Oct 10, 2011 at 10:58 AM, ivo welch ivo.we...@gmail.com wrote: hi steve---agreed...but is there any other computer language in which an expression in a [ . ] is anything except a tensor index selector? Sure, it's a type specifier in scala generics: http://www.scala-lang.org/node/113 Something similar to scale-eez in haskell. Aslo, MATLAB (ugh) it's not even a tensor selector (they use normal parens there). But I'm not sure what that has to do w/ the price of tea in china. With data.table, [ still is tensor-selector like, though. You can just pass in another data.table to use as the keys to do your selection through the `i` argument (like selecting rows), which I guess will likely be your most common use case if you're moving to data.table (presumably you are trying to take advantage of its quickness over big-table-like objects. You can use the `j` param to further manipulate columns. If you pass in a data.table as `i`, it will add its columns to `j`. I'll grant you that it is different than your standard rectangular object selection in R, but the motivation isn't so strange as both i,j params in normal calls to 'xxx[i,j]' are for selecting (ok not manipulating) rows and columns on other rectangular like objects, too. -steve -- Steve Lianoglou Graduate Student: Computational Systems Biology | Memorial Sloan-Kettering Cancer Center | Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] multicore by(), like mclapply?
Package plyr has .parallel. Searching datatable-help for multicore, say on Nabble here, http://r.789695.n4.nabble.com/datatable-help-f2315188.html yields three relevant posts and examples. Please check wiki do's and don'ts to make sure you didn't fall into one of those traps, though (we don't know data or task so just guessing) : http://rwiki.sciviews.org/doku.php?id=packages:cran:data.table HTH Matthew ivo welch ivo.we...@gmail.com wrote in message news:CAPr7RtUroPQtQvoh5uBuT60OYkwGR+ufGr_Z=g5g+vljeoj...@mail.gmail.com... dear r experts---Is there a multicore equivalent of by(), just like mclapply() is the multicore equivalent of lapply()? if not, is there a fast way to convert a data.table into a list based on a column that lapply and mclapply can consume? advice appreciated...as always. regards, /iaw Ivo Welch (ivo.we...@gmail.com) __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Efficient way to do a merge in R
Joshua Wiley jwiley.ps...@gmail.com wrote in message news:canz9z_kopuwkzb-zxr96pvulhhf2znxntxso9xnyho-_jum...@mail.gmail.com... On Tue, Oct 4, 2011 at 12:40 AM, Rainer Schuermann rainer.schuerm...@gmx.net wrote: Any comments are very welcome, 3. If that fails, and nobody else has a better idea, I would consider using a database engine for the job. Not a bad idea for working with large datasets either. or, the data.table package http://datatable.r-forge.r-project.org/ Matthew __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] cannot install.packages(data.table)
Assuming you can install other packages ok, data.table depends on R =2.12.0. Which version of R do you have? _If_ that's the problem, does anyone know if anything prevents R's error message from stating which dependency isn't satisfied? I think I've seen users confused by this before, for other packages too. Matthew Emmanuel Mayssat emays...@gmail.com wrote in message news:cacb6zmctdrjkbftqrw+tv2owptrkgwytc_-hvvtguzwu9gq...@mail.gmail.com... Hello, I am new at R. I am trying to see if R can work for me. I need to do database like lookup (select * from table where name=='toto') and work with matrix (transpose, add columns, remove rows, etc). It seems that the data.table package can help. http://rwiki.sciviews.org/doku.php?id=packages:cran:data.table I installed R and ... install.packages(data.table) Warning in install.packages(data.table) : argument 'lib' is missing: using '/usr/local/lib/R/site-library' Warning message: In getDependencies(pkgs, dependencies, available, lib) : package data.table is not available install.packages() doesn't show the package. where can I find it? -- Emmanuel __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] formatting a 6 million row data set; creating a censoring variable
This is the fastest data.table way I can think of : ans = mydt[,list(mytime=.N),by=list(id,mygroup)] ans[,censor:=0L] ans[J(unique(id)), censor:=1L, mult=last] id mygroup mytime censor [1,] 1 A 1 1 [2,] 2 B 3 0 [3,] 2 C 3 0 [4,] 2 D 6 1 [5,] 3 A 3 0 [6,] 3 B 3 1 [7,] 4 A 1 1 I'll post the timings on the real data set shortly. Please do. Matthew William Dunlap wdun...@tibco.com wrote in message news:e66794e69cfde04d9a70842786030b9304e...@pa-mbx04.na.tibco.com... I'll assume that all of an individual's data rows are contiguous and that an individual always passes through the groups in order (or, least, the individual never leaves a group and then reenters it), so we can find everything we need to know by comparing each row with the previous row. You can use rle() to quickly make the time column: rle(paste(d$mygroup, d$id))$lengths [1] 1 3 3 6 3 3 1 For the censor column it is probably easiest to consider what rle() must do internally and use a modification of that. E.g., isFirstInRun - function(x) c(TRUE, x[-1] != x[-length(x)]) isLastInRun - function(x) c(x[-1] != x[-length(x)], TRUE) outputRows - isLastInRun(d$mygroup) | isLastInRun(d$id) output - d[outputRows, ] output$mytime - diff(c(0, which(outputRows))) output$censor - as.integer(isLastInRun(e$id)) which gives you output gender mygroup id mytimes censor 1 F A 1 1 1 4 F B 2 3 0 7 F C 2 3 0 13 F D 2 6 1 16 M A 3 3 0 19 M B 3 3 1 20 M A 4 1 1 You showed a rearrangment of the columns output[, c(id, mygroup, mytime, censor)] id mygroup mytime censor 1 1 A 1 1 4 2 B 3 0 7 2 C 3 0 13 2 D 6 1 16 3 A 3 0 19 3 B 3 1 20 4 A 1 1 This ought to be quicker than plyr, but data.table may do similar run-oriented operations. Bill Dunlap Spotfire, TIBCO Software wdunlap tibco.com -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of Juliet Hannah Sent: Wednesday, August 31, 2011 10:51 AM To: r-help@r-project.org Subject: [R] formatting a 6 million row data set; creating a censoring variable List, Consider the following data. gender mygroup id 1 F A 1 2 F B 2 3 F B 2 4 F B 2 5 F C 2 6 F C 2 7 F C 2 8 F D 2 9 F D 2 10 F D 2 11 F D 2 12 F D 2 13 F D 2 14 M A 3 15 M A 3 16 M A 3 17 M B 3 18 M B 3 19 M B 3 20 M A 4 Here is the reshaping I am seeking (explanation below). id mygroup mytime censor [1,] 1 A 1 1 [2,] 2 B 3 0 [3,] 2 C 3 0 [4,] 2 D 6 1 [5,] 3 A 3 0 [6,] 3 B 3 1 [7,] 4 A 1 1 I need to create 2 variables. The first one is a time variable. Observe that for id=2, the variable mygroup=B was observed 3 times. In the solution we see in row 2 that id=2 has a mytime variable of 3. Next, I need to create a censoring variable. Notice id=2 goes through has values of B, C, D for mygroup. This means the change from B to C and C to D is observed. There is no change from D. I need to indicate this with a 'censoring' variable. So B and C would have values 0, and D would have a value of 1. As another example, id=1 never changes, so I assign it censor= 1. Overall, if a change is observed, 0 should be assigned, and if a change is not observed 1 should be assigned. One potential challenge is that the original data set has over 5 million rows. I have ideas, but I'm still getting used the the data.table and plyr syntax. I also seek a base R solution. I'll post the timings on the real data set shortly. Thanks for your help. sessionInfo() R version 2.13.1 (2011-07-08) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=C LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base # Here is a simplified data set myData - structure(list(gender = c(F, F, F, F, F, F, F, F, F, F, F, F, F, M, M, M, M, M, M, M ), mygroup = c(A, B, B, B, C, C, C, D, D, D, D, D, D,
Re: [R] ddply from plyr package - any alternatives?
Adam, because I did not have time to entirely test Do you (or does your company) have an automated test suite in place? R 2.10.0 is nearly two years old, and R 2.12.0 is nearly one. Matthew AdamMarczak adam.marc...@gmail.com wrote in message news:1314385041626-3771731.p...@n4.nabble.com... No, it's not much faster. I'd say it's faster about 10-15% in my case. I dont want neither plyr or data.table package because our software on the server does not support R version over 2.10 and both of them have dependency for R = 2.12. Also I do not want to use old archives because I did not have time to entirely test them as it was quick demand for workaround. Best regards, Adam. -- View this message in context: http://r.789695.n4.nabble.com/ddply-from-plyr-package-any-alternatives-tp3765936p3771731.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Sequential Naming of ggplot .pngs using plyr
Hi Justin, In data.table 1.6.1 there was this news item : oj's environment is now consistently reused so that local variables may be set which persist from group to group; e.g., incrementing a group counter : DT[,list(z,groupInd-groupInd+1),by=x] One of the reasons data.table is fast is that there is no function run per group. It's just that j expression. That's run in the same persistent environment for each group, so you can do things like increment a group counter within it. If your data were in 'long' format (data.table prefers long format, like a database) it might be something like (the ggplot line is untested) : ctr = 1 DT[,{ png(file=paste('/tmp/plot_number_',ctr,'.png',sep=''),height=8.5,width=11,units='in',pointsize=9,res=300) print(ggplot(aes(x=site,y=val))+geom_boxplot()+opts(title=paste('plot number',ctr,sep=' '))) dev.off() ctr-ctr+1 }, by=site] Btw, there was a new feature in 1.6.3, where you can subassign into data.table 500 times faster than -. See the NEWS from 1.6.3 for an example : http://datatable.r-forge.r-project.org/ Matthew Justin Haynes jto...@gmail.com wrote in message news:CAFaj53kjqy=1bJy+iLjeeLYKgvx=rte2h_ha24pt20wqvch...@mail.gmail.com... Thanks Ista, In my real code that is exactly what I'm doing, but I want to prepend the names with a sequential number for easier reference once the pngs are made. My initial thought was to add the sequential number to the data before sending it to plyr and drawing it out there, but that seems like an excessive extra step when I have 1e6 - 1e7 rows. Justin On Wed, Aug 10, 2011 at 2:42 PM, Ista Zahn iz...@psych.rochester.eduwrote: Hi Justin, On Wed, Aug 10, 2011 at 5:04 PM, Justin Haynes jto...@gmail.com wrote: If I have data: dat-data.frame(a=rnorm(20),b=rnorm(20),c=rnorm(20),d=rnorm(20),site=rep(letters[5:8],each=5)) And want to plot like this: ctr-1 for(i in c('a','b','c','d')){ png(file=paste('/tmp/plot_number_',ctr,'.png',sep=''),height=8.5, width=11,units='in',pointsize=9,res=300) print(ggplot(dat[,names(dat) %in% c('site',i)],aes(x=factor(site),y=dat[,i]))+geom_boxplot()+opts(title=paste('plot number',ctr,sep=' '))) dev.off() ctr-ctr+1 } Is there a way to do the same naming using plyr (or data.table or foreach which I am not familiar with at all!)? This is not the same naming, but the same general idea can be achieved with plyr using d_ply(melt(dat,id.vars='site'),.(variable),function(df) { png(file=paste(plyr_plot, unique(df$variable), .png),height=8.5,width=11,units='in',pointsize=9,res=300) print(ggplot(df,aes(x=factor(site),y=value))+geom_boxplot()) dev.off() }) I'm not up to speed on .parallel, foreach etc., so I'l leave the rest to someone else. Best, Ista m.dat-melt(dat,id.vars='site') ddply(m.dat,.(variable),function(df) print(ggplot(df,aes(x=factor(site),y=value))+geom_boxplot()+ ..?) And better yet, is there a way to do it using .parallel=T? Faceting is not really an option (unless I can facet onto multiple pages of a pdf or something) because these need to go into reports as individually labelled and titled plots. As a bit of a corollary, is it really worth the headache to resolve this if I am only using melt/plyr to split on the four letter variables? With a larger set of data (1e6 rows), the melt/plyr version takes a significant amount of time but .parallel=T drops the time significantly. Is the right answer a foreach loop and can I do that with the increasing counter? (I haven't gotten beyond Hadley's .parallel feature in my parallel R dealings.) dat-data.frame(a=rnorm(1e6),b=rnorm(1e6),c=rnorm(1e6),d=rnorm(1e6),site=rep(letters[5:8],each=2.5e5)) ctr-1 system.time(for(i in c('a','b','c','d')){ + png(file=paste('/tmp/plot_number_',ctr,'.png',sep=''),height=8.5, width=11,units='in',pointsize=9,res=300) + print(ggplot(dat[,names(dat) %in% c('site',i)],aes(x=factor(site),y=dat[,i]))+geom_boxplot()+opts(title=paste('plot number',ctr,sep=' '))) + dev.off() + ctr-ctr+1 + }) user system elapsed 54.630 0.120 54.843 system.time( + ddply(melt(dat,id.vars='site'),.(variable),function(df) { + png(file='/tmp/plyr_plot.png',height=8.5,width=11,units='in',pointsize=9,res=300) + print(ggplot(df,aes(x=factor(site),y=value))+geom_boxplot()) + dev.off() + },.parallel=F) + ) user system elapsed 58.400.13 58.63 system.time( + ddply(melt(dat,id.vars='site'),.(variable),function(df) { + png(file='/tmp/plyr_plot.png',height=8.5,width=11,units='in',pointsize=9,res=300) + print(ggplot(df,aes(x=factor(site),y=value))+geom_boxplot()) + dev.off() + },.parallel=T) + ) user system elapsed 70.333.46 27.61 How might I speed this up and include the sequential
Re: [R] EXTERNAL: Re: subset with aggregate key
To close this thread on-list : packageVersion() was added to R in 2.12.0. data.table's dependency on 2.12.0 is updated, thanks. Matthew Jesse Brown jesse.r.br...@lmco.com wrote in message news:4e1b21a8.8090...@atl.lmco.com... Matthew Dowle wrote: Hi, Try package 'data.table'. It has a concept of keys which allows you to do exactly that. http://datatable.r-forge.r-project.org/ Matthew Hi Matthew, Unfortunately, the load of that library fails (it builds successfully). I'm currently looking into why. Error output looks something similar to: library(data.table) Error in .makeMessage(..., domain = domain, appendLF = appendLF) : could not find function packageVersion Error : .onAttach failed in 'attachNamespace' Error: package/namespace load failed for 'data.table' I googled around a bit and there is mention of a bug in packageVersion but there was no solution that I found. Is this something that is easily overcome? Thanks, Jesse __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] manipulating by lists and ave() functions
Users of package 'unknownR' already know simplify2array was added in R 2.13.0. They also know what else was added. Do you? http://unknownr.r-forge.r-project.org/ Joshua Wiley jwiley.ps...@gmail.com wrote in message news:canz9z_j+trwoim3scayuaruors+8hyc30pmt_thiex6qmto...@mail.gmail.com... On Sat, Jul 9, 2011 at 7:32 AM, David Winsemius dwinsem...@comcast.net wrote: On Jul 9, 2011, at 9:44 AM, Berry Boessenkool wrote: Maybe I'm missing something, but in what package do I find that function? simplify2array(b) Fehler: konnte Funktion simplify2array nicht finden # Function wasn't found help.search(simplify2array) No help files found with alias or concept or title matching 'simplify2array' using fuzzy matching. Perhaps its new, since ?simplify2array brings up a help page and it's in base. Try updating. Yes, simplify2array() was added in R 2.13.0 to support the simplify = array argument to sapply(). Josh [snip] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Simple order() data frame question.
With data.table, the following is routine : DT[order(a)] # ascending DT[order(-a)] # descending, if a is numeric DT[a5,sum(z),by=c][order(-V1)] # sum of z group by c, just where a5, then show me the largest first DT[order(-a,b)] # order by a descending then by b ascending, if a and b are both numeric It avoids peppering your code with $, and becomes quite natural after a short while; especially compound queries such as the 3rd example. Matthew http://datatable.r-forge.r-project.org/ Ivan Calandra ivan.calan...@uni-hamburg.de wrote in message news:4dcbec8b.6040...@uni-hamburg.de... I was wondering whether it would be possible to make a method for data.frame with sort(). I think it would be more intuitive than using the complex construction of df[order(df$a),] Is there any reason not to make it? Ivan Le 5/12/2011 15:40, Marc Schwartz a écrit : On May 12, 2011, at 8:09 AM, John Kane wrote: Argh. I knew it was at least partly obvious. I never have been able to read the order() help page and understand what it is saying. THanks very much. By the way, to me it is counter-intuitive that the the command is df1[order(df1[,2],decreasing=TRUE),] For some reason I keep expecting it to be order( , df1[,2],decreasing=TRUE) So clearly I don't understand what is going on but at least I a lot better off. I may be able to get this graph to work. John, Perhaps it may be helpful to understand that order() does not actually sort() the data. It returns a vector of indices into the data, where those indices are the sorted ordering of the elements in the vector, or in this case, the column. So you want the output of order() to be used within the brackets for the row *indices*, to reflect the ordering of the column (or columns in the case of a multi-level sort) that you wish to use to sort the data frame rows. set.seed(1) x- sample(10) x [1] 3 4 5 7 2 8 9 6 10 1 # sort() actually returns the sorted data sort(x) [1] 1 2 3 4 5 6 7 8 9 10 # order() returns the indices of 'x' in sorted order order(x) [1] 10 5 1 2 3 8 4 6 7 9 # This does the same thing as sort() x[order(x)] [1] 1 2 3 4 5 6 7 8 9 10 set.seed(1) df1- data.frame(aa = letters[1:10], bb = rnorm(10)) df1 aa bb 1 a -0.6264538 2 b 0.1836433 3 c -0.8356286 4 d 1.5952808 5 e 0.3295078 6 f -0.8204684 7 g 0.4874291 8 h 0.7383247 9 i 0.5757814 10 j -0.3053884 # These are the indices of df1$bb in sorted order order(df1$bb) [1] 3 6 1 10 2 5 7 9 8 4 # Get df1$bb in increasing order df1$bb[order(df1$bb)] [1] -0.8356286 -0.8204684 -0.6264538 -0.3053884 0.1836433 0.3295078 [7] 0.4874291 0.5757814 0.7383247 1.5952808 # Same thing as above sort(df1$bb) [1] -0.8356286 -0.8204684 -0.6264538 -0.3053884 0.1836433 0.3295078 [7] 0.4874291 0.5757814 0.7383247 1.5952808 You can't use the output of sort() to sort the data frame rows, so you need to use order() to get the ordered indices and then use that to extract the data frame rows in the sort order that you desire: df1[order(df1$bb), ] aa bb 3 c -0.8356286 6 f -0.8204684 1 a -0.6264538 10 j -0.3053884 2 b 0.1836433 5 e 0.3295078 7 g 0.4874291 9 i 0.5757814 8 h 0.7383247 4 d 1.5952808 df1[order(df1$bb, decreasing = TRUE), ] aa bb 4 d 1.5952808 8 h 0.7383247 9 i 0.5757814 7 g 0.4874291 5 e 0.3295078 2 b 0.1836433 10 j -0.3053884 1 a -0.6264538 6 f -0.8204684 3 c -0.8356286 Does that help? Regards, Marc Schwartz __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Ivan CALANDRA PhD Student University of Hamburg Biozentrum Grindel und Zoologisches Museum Abt. Säugetiere Martin-Luther-King-Platz 3 D-20146 Hamburg, GERMANY +49(0)40 42838 6231 ivan.calan...@uni-hamburg.de ** http://www.for771.uni-bonn.de http://webapp5.rrz.uni-hamburg.de/mammals/eng/1525_8_1.php __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] [R-pkgs] unknownR : you didn't know you didn't know?
Do you know how many functions there are in base R? How many of them do you know you don't know? Run unk() to discover your unknown unknowns. It's fast and it's fun! unknownR v0.2 is now on CRAN. More information is on the homepage : http://unknownr.r-forge.r-project.org/ Or, just install the package and try it : install.packages(unknownR) library(unknownR) ?unk unk() learn() Matthew ___ R-packages mailing list r-packa...@r-project.org https://stat.ethz.ch/mailman/listinfo/r-packages __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] [R-pkgs] data.table 1.6 is now on CRAN
data.table offers fast subset, fast grouping and fast ordered joins in a short and flexible syntax, for faster development. It was first released in August 2008 and is now the 3rd most popular package on Crantastic with 20 votes and 7 reviews. * X[Y] is a fast join for large data. * X[,sum(b*c),by=a] is fast aggregation. * 10+ times faster than tapply() * 100+ times faster than == It inherits from data.frame. It is compatible with packages that only accept data.frame. This is a major release that adds S4 compatibility to the package, contributed by Steve Lianoglou. Recently the FAQs have been revised and ?data.table has been simplified with shorter and easier examples. There is a wiki (with content), three vignettes, a video, a NEWS file and an active user community. http://datatable.r-forge.r-project.org/ http://unknownr.r-forge.r-project.org/toppkgs.html Matthew, Tom and Steve ___ R-packages mailing list r-packa...@r-project.org https://stat.ethz.ch/mailman/listinfo/r-packages __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] R licence
Peter, If the proprietary part of REvolution's product is ok, then surely Stanislav's suggestion is too. No? Matthew peter dalgaard pda...@gmail.com wrote in message news:be157cf5-9b4b-45a0-a7d4-363b774f1...@gmail.com... On Apr 7, 2011, at 09:45 , Stanislav Bek wrote: Hi, is it possible to use some statistic computing by R in proprietary software? Our software is written in c#, and we intend to use http://rdotnet.codeplex.com/ to get R work there. Especially we want to use loess function. You need to take legal advice to be certain, but offhand I would say that this kind of circumvention of the GPL is _not_ allowed. It all depends on whether the end product is a derivative work, in which case, the whole must be distributed under a GPL-compatible licence. The situation around GPL-incompatible plug-ins or plug-ins interfacing to R in GPL -incompatible software is legally murky, but using R as a subroutine library for proprietary code is clearly crossing the line, as far as I can tell. -- Peter Dalgaard Center for Statistics, Copenhagen Business School Solbjerg Plads 3, 2000 Frederiksberg, Denmark Phone: (+45)38153501 Email: pd@cbs.dk Priv: pda...@gmail.com __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] R licence
Duncan, Letting you know then that I just don't see how the first paragraph here : http://www.revolutionanalytics.com/downloads/gpl-sources.php is compatible with clause 2(b) here : http://www.gnu.org/licenses/gpl-2.0.html Perhaps somebody could explain why it is? Matthew Duncan Murdoch murdoch.dun...@gmail.com wrote in message news:4d9da9ff.9020...@gmail.com... On 07/04/2011 7:47 AM, Matthew Dowle wrote: Peter, If the proprietary part of REvolution's product is ok, then surely Stanislav's suggestion is too. No? Revolution has said that they believe they follow the GPL, and they haven't been challenged on that. If you think that they don't, you could let an R copyright holder know what they're doing that's a license violation. My opinion of Stanislav's question is that he doesn't give enough information to answer. If he is planning to distribute R as part of his product, he needs to follow the GPL. If not, I don't think any R copyright holder has anything to complain about. Duncan Murdoch Matthew peter dalgaardpda...@gmail.com wrote in message news:be157cf5-9b4b-45a0-a7d4-363b774f1...@gmail.com... On Apr 7, 2011, at 09:45 , Stanislav Bek wrote: Hi, is it possible to use some statistic computing by R in proprietary software? Our software is written in c#, and we intend to use http://rdotnet.codeplex.com/ to get R work there. Especially we want to use loess function. You need to take legal advice to be certain, but offhand I would say that this kind of circumvention of the GPL is _not_ allowed. It all depends on whether the end product is a derivative work, in which case, the whole must be distributed under a GPL-compatible licence. The situation around GPL-incompatible plug-ins or plug-ins interfacing to R in GPL -incompatible software is legally murky, but using R as a subroutine library for proprietary code is clearly crossing the line, as far as I can tell. -- Peter Dalgaard Center for Statistics, Copenhagen Business School Solbjerg Plads 3, 2000 Frederiksberg, Denmark Phone: (+45)38153501 Email: pd@cbs.dk Priv: pda...@gmail.com __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] General binary search?
Try data.table:::sortedmatch, which is implemented in C. It requires it's input to be sorted (and doesn't check) Stavros Macrakis macra...@alum.mit.edu wrote in message news:BANLkTi=j2lf5syxytv1dd4k9wr0zgk8...@mail.gmail.com... Is there a generic binary search routine in a standard library which a) works for character vectors b) runs in O(log(N)) time? I'm aware of findInterval(x,vec), but it is restricted to numeric vectors. I'm also aware of various hashing solutions (e.g. new.env(hash=TRUE) and fastmatch), but I need the greatest-lower-bound match in my application. findInterval is also slow for large N=length(vec) because of the O(N) checking it does, as Duncan Murdoch has pointed outhttps://stat.ethz.ch/pipermail/r-help/2008-September/174584.html: though its documentation says it runs in O(n * log(N)), it actually runs in O(n * log(N) + N), which is quite noticeable for largish N. But that is easy enough to work around by writing a variant of findInterval which calls find_interv_vec without checking. -s PS Yes, binary search is a one-liner in R, but I always prefer to use standard, fast native libraries when possible binarysearch - function(val,tab,L,H) {while (H=L) { M=L+(H-L) %/% 2; if (tab[M]val) H-M-1 else if (tab[M]val) L-M+1 else return(M)}; return(L-1)} [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] How to calculate means for multiple variables in samples with different sizes
Hi, One liners in data.table are : x.dt[,lapply(.SD,mean),by=sample] sample replicate heightweight age [1,] A 2.0 12.2 0.503 6.00 [2,] B 1.5 12.75000 0.715 4.50 [3,] C 2.5 11.35250 0.5125000 3.75 [4,] D 2.0 14.99333 0.673 5.33 without the replicate column : x.dt[,lapply(list(height,weight,age),mean),by=sample] sample V1V2 V3 [1,] A 12.2 0.503 6.00 [2,] B 12.75000 0.715 4.50 [3,] C 11.35250 0.5125000 3.75 [4,] D 14.99333 0.673 5.33 one (long) way to retain the column names : x.dt[,lapply(list(height=height,weight=weight,age=age),mean),by=sample] sample heightweight age [1,] A 12.2 0.503 6.00 [2,] B 12.75000 0.715 4.50 [3,] C 11.35250 0.5125000 3.75 [4,] D 14.99333 0.673 5.33 or this is shorter : ans = x.dt[,lapply(.SD,mean),by=sample] ans$replicate = NULL ans sample heightweight age [1,] A 12.2 0.503 6.00 [2,] B 12.75000 0.715 4.50 [3,] C 11.35250 0.5125000 3.75 [4,] D 14.99333 0.673 5.33 or another way : mycols = c(height,weight,age) x.dt[,lapply(.SD[,mycols,with=FALSE],mean),by=sample] sample heightweight age [1,] A 12.2 0.503 6.00 [2,] B 12.75000 0.715 4.50 [3,] C 11.35250 0.5125000 3.75 [4,] D 14.99333 0.673 5.33 or another way : x.dt[,lapply(.SD[,list(height,weight,age)],mean),by=sample] sample heightweight age [1,] A 12.2 0.503 6.00 [2,] B 12.75000 0.715 4.50 [3,] C 11.35250 0.5125000 3.75 [4,] D 14.99333 0.673 5.33 The way Jim showed : x.dt[, list(height = mean(height) +, weight = mean(weight) +, age = mean(age) +), by = sample] is the more flexible syntax for when you want different functions on different columns, easily, and as a bonus is fast. Matthew Dennis Murphy djmu...@gmail.com wrote in message news:AANLkTimxXL8BqTaYKUb=saee2cra9fosfuap4qzkx...@mail.gmail.com... Hi: Here are a few one-liners. Calling your data frame dd, aggregate(cbind(height, weight, age) ~ sample, data = dd, FUN = mean) sample heightweight age 1 A 12.2 0.503 6.00 2 B 12.75000 0.715 4.50 3 C 11.35250 0.5125000 3.75 4 D 14.99333 0.673 5.33 With package doBy: library(doBy) summaryBy(height + weight + age ~ sample, data = dd, FUN = mean) sample height.mean weight.mean age.mean 1 A12.2 0.503 6.00 2 B12.75000 0.715 4.50 3 C11.35250 0.5125000 3.75 4 D14.99333 0.673 5.33 With package plyr: library(plyr) ddply(dd, .(sample), colwise(mean, .(height, weight, age))) sample heightweight age 1 A 12.2 0.503 6.00 2 B 12.75000 0.715 4.50 3 C 11.35250 0.5125000 3.75 4 D 14.99333 0.673 5.33 Dennis On Fri, Mar 11, 2011 at 1:32 AM, Aline Santos aline...@gmail.com wrote: Hello R-helpers: I have data like this: samplereplicateheightweightage A1.0012.00.646.00 A2.0012.20.386.00 A3.0012.40.496.00 B1.0012.70.654.00 B2.0012.80.785.00 C1.0011.90.456.00 C2.0011.840.442.00 C3.0011.430.323.00 C4.0010.240.844.00 D1.0014.20.542.00 D2.0015.670.677.00 D3.0015.110.817.00 Now, how can I calculate the mean for each condition (heigth, weigth, age) in each sample, considering the samples have different number of replicates? The final matrix should look like: sampleheightweightage A12.200.506.00 B 12.75 0.72 4.50 C 11.35 0.51 3.75 D 14.99 0.67 5.33 This is a simplified version of my dataset, which consist of 100 samples (unequally distributed in 530 replicates) for 600 different conditions. I appreciate all the help. A.S. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Transforming relational data
With the new example, what is the full output, and what do you need instead? Was it correct for the previous example? Matthew mathijsdevaan mathijsdev...@gmail.com wrote in message news:1298372018181-3318939.p...@n4.nabble.com... Hi Matthew, thanks for your help. There are some things going wrong still. Consider this (slightly extended) example: library(data.table) DT = data.table(read.table(textConnection(A B C 1 1 a 1999 2 1 b 1999 3 1 c 1999 4 1 d 1999 5 2 c 2001 6 2 d 2001 7 3 a 2004 8 3 b 2004 9 3 d 2004 10 4 c 2001 11 4 d 2001),head=TRUE,stringsAsFactors=FALSE)) firststep = DT[,cbind(A,expand.grid(B,B),v=1/length(B)),by=C][Var1!=Var2] firststep C A Var1 Var2 v 1 1999 1ba 0.250 2 1999 1ca 0.250 3 1999 1da 0.250 4 1999 1ab 0.250 5 1999 1cb 0.250 6 1999 1db 0.250 7 1999 1ac 0.250 8 1999 1bc 0.250 9 1999 1dc 0.250 10 1999 1ad 0.250 11 1999 1bd 0.250 12 1999 1cd 0.250 13 2001 2ba 0.250 14 2001 4ba 0.250 15 2001 2ab 0.250 16 2001 4ab 0.250 17 2001 2ba 0.250 18 2001 4ba 0.250 19 2001 2ab 0.250 20 2001 4ab 0.250 21 2004 3ba 0.333 22 2004 3ca 0.333 23 2004 3ab 0.333 24 2004 3cb 0.333 25 2004 3ac 0.333 26 2004 3bc 0.333 Following firststep, project 2 and 4 involved individuals a and b, while actually c and d were involved. It seems that there is something going wrong in transforming the data. Then going to the final result, a list is generated of years and sums of v, rather than a list of projects and sums of v. Probably I haven't been clear enough: I want to produce a list of all projects and the familiarity of all project members involved right before the start of the project. Example project_id familiarity 4 0.25 Members c and d were jointly involved in 3 projects: 1,2,4. Project 4 took place in 2001, so only project 1 took place before that (1999 (project 2 took place in the same year and is therefore not included). The average familiarity between the members in project 1 was 1/4, so: project_id familiarity 4 0.25 Thanks! Matthew Dowle wrote: Thanks for the attempt and required output. How about this? firststep = DT[,cbind(expand.grid(B,B),v=1/length(B)),by=C][Var1!=Var2] setkey(firststep,Var1,Var2,C) firststep = firststep[,transform(.SD,cv=cumsum(v)),by=list(Var1,Var2)] setkey(firststep,Var1,Var2,C) DT[, {x=data.table(expand.grid(B,B),C[1]-1L) firststep[x,roll=TRUE,nomatch=0][,sum(cv)] # prior familiarity },by=C] C V1 [1,] 1999 0.0 [2,] 2001 0.5 [3,] 2004 2.5 I think you may have said you have large data. If so, this method should be fast. Please let us know how you get on. HTH Matthew On Thu, 17 Feb 2011 23:07:19 -0800, mathijsdevaan wrote: OK, for the last step I have tried this (among other things): library(data.table) DT = data.table(read.table(textConnection(A B C 1 1 a 1999 2 1 b 1999 3 1 c 1999 4 1 d 1999 5 2 c 2001 6 2 d 2001 7 3 a 2004 8 3 b 2004 9 3 d 2004),head=TRUE,stringsAsFactors=FALSE)) firststep = DT[,cbind(expand.grid(B,B),v=1/length(B)),by=C][Var1!=Var2] setkey(firststep,Var1,Var2) list1-firststep[J(expand.grid(DT$B,DT$B),v=1/length(DT$B)),nomatch=0] [,sum(v)] list1 #27 What I would like to get: list 1 0 2 0.5 3 2.5 Thanks! __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- View this message in context: http://r.789695.n4.nabble.com/Re-Transforming-relational-data-tp3307449p3318939.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Transforming relational data
Thanks. How about this? DT$B = factor(DT$B) firststep = DT[,cbind(expand.grid(B,B),v=1/length(B),C=C[1]),by=A][Var1! =Var2] setkey(firststep,Var1,Var2,C) firststep = firststep[,transform(.SD,cv=cumsum(v)),by=list(Var1,Var2)] setkey(firststep,Var1,Var2,C) DT[, {x=data.table(expand.grid(B,B),C[1]-1L) firststep[x,roll=TRUE,nomatch=0][,sum(cv)] # prior familiarity },by=A] A V1 [1,] 1 0.0 [2,] 2 0.5 [3,] 3 1.5 [4,] 4 0.5 On Tue, 22 Feb 2011 05:02:05 -0800, mathijsdevaan wrote: The output for the new example should be: project v 1 0 2 0.5 3 1.5 4 0.5 The output you calculated was correct for the v per year, but the v per group would be incorrect. I think the problem lies in the fact that expand.grid(B,B) doesn't take into account that combinations of B can only be formed within A. Thanks again! __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Transforming relational data
Thanks for the attempt and required output. How about this? firststep = DT[,cbind(expand.grid(B,B),v=1/length(B)),by=C][Var1!=Var2] setkey(firststep,Var1,Var2,C) firststep = firststep[,transform(.SD,cv=cumsum(v)),by=list(Var1,Var2)] setkey(firststep,Var1,Var2,C) DT[, {x=data.table(expand.grid(B,B),C[1]-1L) firststep[x,roll=TRUE,nomatch=0][,sum(cv)] # prior familiarity },by=C] C V1 [1,] 1999 0.0 [2,] 2001 0.5 [3,] 2004 2.5 I think you may have said you have large data. If so, this method should be fast. Please let us know how you get on. HTH Matthew On Thu, 17 Feb 2011 23:07:19 -0800, mathijsdevaan wrote: OK, for the last step I have tried this (among other things): library(data.table) DT = data.table(read.table(textConnection(A B C 1 1 a 1999 2 1 b 1999 3 1 c 1999 4 1 d 1999 5 2 c 2001 6 2 d 2001 7 3 a 2004 8 3 b 2004 9 3 d 2004),head=TRUE,stringsAsFactors=FALSE)) firststep = DT[,cbind(expand.grid(B,B),v=1/length(B)),by=C][Var1!=Var2] setkey(firststep,Var1,Var2) list1-firststep[J(expand.grid(DT$B,DT$B),v=1/length(DT$B)),nomatch=0] [,sum(v)] list1 #27 What I would like to get: list 1 0 2 0.5 3 2.5 Thanks! __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Transforming relational data
Mathijs, To my eyes you seem to have repeated back what is already done. More R and less English would help. In other words if it is not 2.5 you need, what is it? Please provide some input and state what the output should be (and what you tried already). Matthew -- View this message in context: http://r.789695.n4.nabble.com/Re-Transforming-relational-data-tp3307449p3311954.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] boot.ci error with large data sets
Hello Lars, (cc'd) Did you ask maintainer(boot) first, as requested by the posting guide? If you did, but didn't hear back, then please say so, so that we know you did follow the guide. That maintainer is particularly active, and particularly efficient though, so I doubt you didn't hear back. We can tell it's your first post to r-help, and we can tell you have at least read the posting guide and done very well in following almost all of it. I can't see anything else wrong with your post (and the subject line is good) ... other than where you sent it :-) Matthew Lars Dalby lars.da...@gmail.com wrote in message news:fef4d63e-90f6-43aa-90a6-872792faa...@s11g2000yqc.googlegroups.com... Dear List I have run into some problems with boot.ci from package boot. When I try to obtain a confidence interval of type bca, boot.ci() returns the following error when the data set i large: Error in bca.ci(boot.out, conf, index[1L], L = L, t = t.o, t0 = t0.o, : estimated adjustment 'a' is NA Below is an example that produces the above mentioned error on my machine. library(boot) #The wrapper function: w.mean - function(x, d) { E - x[d,] return(weighted.mean(E$A, E$B))} #Some fake data: test - data.frame(rnorm(1000, 5), rnorm(1000, 3)) test1 - data.frame(rnorm(1, 5), rnorm(1, 3)) names(test) - c(A, B) names(test1) - c(A, B) # Getting the boot object and the CI, seem to works fine bootout - boot(test, w.mean, R=1000, stype=i) (bootci - boot.ci(bootout, conf = 0.95, type = bca)) # Now with a bigger data set, boot.ci returns an error. bootout1 - boot(test1, w.mean, R=1000, stype=i) (bootci1 - boot.ci(bootout1, conf = 0.95, type = bca)) Does anyone have an idea as to why this happens? (Session info below) Best, Lars sessionInfo() R version 2.12.1 (2010-12-16) Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) locale: [1] da_DK.UTF-8/da_DK.UTF-8/C/C/da_DK.UTF-8/da_DK.UTF-8 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] boot_1.2-43 loaded via a namespace (and not attached): [1] tools_2.12.1 __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Transforming relational data
Hello. One (of many) solution might be: require(data.table) DT = data.table(read.table(textConnection(A B C 1 1 a 1999 2 1 b 1999 3 1 c 1999 4 1 d 1999 5 2 c 2001 6 2 d 2001),head=TRUE,stringsAsFactors=FALSE)) firststep = DT[,cbind(expand.grid(B,B),v=1/length(B)),by=C][Var1!=Var2] setkey(firststep,Var1,Var2) grp3 = c(a,b,d) firststep[J(expand.grid(grp3,grp3)),nomatch=0][,sum(v)] # 2.5 If I guess the bigger picture correctly, this can be extended to make a time series of prior familiarity by including the year in the key. If you decide to try this, please make sure to grab the latest (recent) version of data.table from CRAN (v1.5.3). Suggest that you run it first to confirm it does return 2.5, then break it down and run it step by step to see how each part works. You will need some time to read the vignettes and ?data.table (which has recently been improved) but I hope you think it is worth it. Support is available at maintainer(data.table). HTH Matthew On Mon, 14 Feb 2011 09:22:12 -0800, mathijsdevaan wrote: Hi, I have a large dataset with info on individuals (B) that have been involved in projects (A) during multiple years (C). The dataset contains three columns: A, B, C. Example: A B C 1 1 a 1999 2 1 b 1999 3 1 c 1999 4 1 d 1999 5 2 c 2001 6 2 d 2001 7 3 a 2004 8 3 c 2004 9 3 d 2004 I am interested in how well all the individuals in a project know each other. To calculate this team familiarity measure I want to sum the familiarity between all individual pairs in a team. The familiarity between each individual pair in a team is calculated as the summation of each pair's prior co-appearance in a project divided by the total number of team members. So the team familiarity in project 3 = (1/4+1/4) + (1/4+1/4+1/2) + (1/4+1/4+1/2) = 2,5 or a has been in project 1 (of size 4) with c and d 1/4+1/4 and c has been in project 1 (of size 4) with 1 and d 1/4+1/4 and c has been in project 2 (of size 2) with d 1/2. I think that the best way to do it is to transform the data into an edgelist (each pair in one row/two columns) and then creating two additional columns for the strength of the familiarity and the year of the project in which the pair was active. The problem is that I am stuck already in the first step. So the question is: how do I go from the current data structure to a list of projects and the familiarity of its team members? Your help is very much appreciated. Thanks! __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Convert the output of by() to a data frame
There's a much shorter way. You don't need that ugly h() with all those $ and potential for bugs ! Using the original f : dt[,lapply(.SD,f),by=key(dt)] grp1 grp2 grp3 a b d xxx 1.00 81.00 161.00 xxx 10.00 90.00 170.00 xxx 5.50 85.50 165.50 xxx 3.027650 3.027650 3.027650 xxx 1.816590 28.239721 54.662851 xxy 11.00 91.00 171.00 xxy 20.00 100.00 180.00 xxy 15.50 95.50 175.50 xxy 3.027650 3.027650 3.027650 xxy 5.119482 31.542612 57.965742 [ snip ] To get the names included, one (long) way is : dt[,data.table(sapply(.SD,f),keep.rownames=TRUE),by=key(dt)] grp1 grp2 grp3 rn a b d xxx min 1.00 81.00 161.00 xxx max 10.00 90.00 170.00 xxx mean 5.50 85.50 165.50 xxx sd 3.027650 3.027650 3.027650 xxx cv 1.816590 28.239721 54.662851 xxy min 11.00 91.00 171.00 xxy max 20.00 100.00 180.00 xxy mean 15.50 95.50 175.50 xxy sd 3.027650 3.027650 3.027650 xxy cv 5.119482 31.542612 57.965742 [ snip ] However, for speed on large datasets you can drop the names in f : f - function(x) c(min(x), max(x), mean(x), sd(x), mean(x)/sd(x)) and put the names in afterwards. ans = dt[,lapply(.SD,f),by=key(dt)] ans$labels = c(min,max,mean,sd,cv) ans grp1 grp2 grp3 a b d labels xxx 1.00 81.00 161.00min xxx 10.00 90.00 170.00max xxx 5.50 85.50 165.50 mean xxx 3.027650 3.027650 3.027650 sd xxx 1.816590 28.239721 54.662851 cv xxy 11.00 91.00 171.00min xxy 20.00 100.00 180.00max xxy 15.50 95.50 175.50 mean xxy 3.027650 3.027650 3.027650 sd xxy 5.119482 31.542612 57.965742 cv [ snip ] You don't want all those small pieces of memory for the names to be created over and over again every time f runs. That's only important for large datasets, though. Matthew -- View this message in context: http://r.789695.n4.nabble.com/Convert-the-output-of-by-to-a-data-frame-tp3268428p3278326.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] aggregate function - na.action
Looking at the timings by each stage may help : system.time(dt - data.table(dat)) user system elapsed 1.200.281.48 system.time(setkey(dt, x1, x2, x3, x4, x5, x6, x7, x8)) # sort by the 8 columns (one-off) user system elapsed 4.720.945.67 system.time(udt - dt[, list(y = sum(y, na.rm = TRUE)), by = 'x1, x2, x3, x4, x5, x6, x7, x8']) user system elapsed 2.000.212.20 # compared to 11.07s data.table doesn't have a custom data structure, so it can't be that. data.table's structure is the same as data.frame i.e. a list of vectors. data.table inherits from data.frame. It *is* a data.frame, too. The reasons it is faster in this example include : 1. Memory is only allocated for the largest group. 2. That memory is re-used for each group. 3. Since the data is ordered contiguously in RAM, the memory is copied over in bulk for each group using memcpy in C, which is faster than a for loop in C. Page fetches are expensive; they are minimised. This is explained in the documentation, in particular the FAQs. This example is quite small, but the concept scales to larger sizes i.e. the difference widens further as n increases. http://datatable.r-forge.r-project.org/ Matthew Hadley Wickham had...@rice.edu wrote in message news:aanlktim6drfjxqrsqlxof1ut6xr_bshqdbgpktmed...@mail.gmail.com... There's definitely something amiss with aggregate() here since similar functions from other packages can reproduce your 'control' sum. I expect ddply() will have some timing issues because of all the subgrouping in your data frame, but data.table did very well and the summaryBy() function in the doBy package did OK: Well, if you use the right plyr function, it works just fine: system.time(count(dat, c(x1, x2, x3, x4, x4, x5, x6, x7, x8), y)) # user system elapsed # 9.754 1.314 11.073 Which illustrates something that I've believed for a while about data.table - it's not the indexing that speed things up, it's the custom data structure. If you use ddply with data frames, it's slow because data frames are slow. I think the right way to resolve this is to to make data frames more efficient, perhaps using some kind of mutable interface where necessary for high-performance operations. Hadley -- Assistant Professor / Dobelman Family Junior Chair Department of Statistics / Rice University http://had.co.nz/ __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] using character vector as input argument to setkey (data.tablepakcage)
Hi Sean, Try : key(test.dt) = c(a,b) Btw, the posting guide asks you to contact the maintainer of the package before r-help. Otherwise r-help would fill up with posts about 2000+ packages (I guess is the reason). In this case maintainer(data.table) returns datatable-h...@lists.r-forge.r-project.org (cc'd) where you will be very welcome. Matthew __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] aggregate function - na.action
Hi Hadley, Does FAQ 1.8 answer that ok ? Ok, I'm starting to see what data.table is about, but why didn't you enhance data.frame in R? Why does it have to be a new package? http://datatable.r-forge.r-project.org/datatable-faq.pdf Matthew Hadley Wickham had...@rice.edu wrote in message news:AANLkTik180p4YmBtR3QUCW7r=fdefxzbxsy3zwtik...@mail.gmail.com... On Mon, Feb 7, 2011 at 5:54 AM, Matthew Dowle mdo...@mdowle.plus.com wrote: Looking at the timings by each stage may help : system.time(dt - data.table(dat)) user system elapsed 1.20 0.28 1.48 system.time(setkey(dt, x1, x2, x3, x4, x5, x6, x7, x8)) # sort by the 8 columns (one-off) user system elapsed 4.72 0.94 5.67 system.time(udt - dt[, list(y = sum(y, na.rm = TRUE)), by = 'x1, x2, x3, x4, x5, x6, x7, x8']) user system elapsed 2.00 0.21 2.20 # compared to 11.07s data.table doesn't have a custom data structure, so it can't be that. data.table's structure is the same as data.frame i.e. a list of vectors. data.table inherits from data.frame. It *is* a data.frame, too. The reasons it is faster in this example include : 1. Memory is only allocated for the largest group. 2. That memory is re-used for each group. 3. Since the data is ordered contiguously in RAM, the memory is copied over in bulk for each group using memcpy in C, which is faster than a for loop in C. Page fetches are expensive; they are minimised. But this is exactly what I mean by a custom data structure - you're not using the usual data frame API. Wouldn't it be better to implement these changes to data frame so that everyone can benefit? Or is it just too specialised to this particular case (where I guess you're using that the return data structure of the summary function is consistent)? Hadley -- Assistant Professor / Dobelman Family Junior Chair Department of Statistics / Rice University http://had.co.nz/ __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] aggregate function - na.action
Hadley, That's fine; please do. I'm happy to explain it offline where the documentation or comments in the code aren't sufficient. It's GPL code so you can take it and improve it, or depend on it. Whatever works for you. As long as (of course) you don't stand on it's shoulders and then restrict users' freedoms (not that I'd ever think you'd do that). One thing that did make it into R was the improvement to unique.c in R 2.12.0. Another that we hope happens one day is changing duplicate.c to use memcpy. That would automatically benefit all users anywhere R copies data (including data.frame). That wasn't our idea; that's been a FIXME in the R source for many years. See thread on r-devel a while back (search for duplicate.c in subject). It probably just needs someone to send a working patch file that passes checks. That's an example of something in the data.table C code that (hopefully) will make it into base R. Matthew Hadley Wickham had...@rice.edu wrote in message news:AANLkTi=setpquiyr1+avb4-ga1-fyh9uffa6mskk+...@mail.gmail.com... Does FAQ 1.8 answer that ok ? Ok, I'm starting to see what data.table is about, but why didn't you enhance data.frame in R? Why does it have to be a new package? http://datatable.r-forge.r-project.org/datatable-faq.pdf Kind of. I think there are two sets of features data.table provides: * a compact syntax for expressing many common data manipulations * high performance data manipulation FAQ 1.8 answers the question for the syntax, but not for the performance related features. Basically, I'd love to be able to use the high performance components of data table in plyr, but keep using my existing syntax. Currently the only way to do that is for me to dig into your C code to understand why it's fast, and then implement those ideas in plyr. Hadley -- Assistant Professor / Dobelman Family Junior Chair Department of Statistics / Rice University http://had.co.nz/ __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Counting number of rows with two criteria in dataframe
Note that a key is not actually required, so it's even simpler syntax : dX = as.data.table(X) dX[,length(unique(z)),by=x,y] x y V1 [1,] 1 1 2 [2,] 1 2 2 [3,] 2 3 2 [4,] 2 4 2 [5,] 3 5 2 [6,] 3 6 2 or passing list() syntax to the 'by' is exactly the same : dX[,length(unique(z)),by=list(x,y)] The advantage of using the list() form is you can group by expressions of columns, for example if x was a date column : dX[,length(unique(z)),by=list(month(x),y)] Matthew Dennis Murphy djmu...@gmail.com wrote in message news:AANLkTi=8tysrrfzfm01m7fpzydh-cls-j-cmbkakj...@mail.gmail.com... Hi: Here are two more candidates, using the plyr and data.table packages: library(plyr) ddply(X, .(x, y), function(d) length(unique(d$z))) x y V1 1 1 1 2 2 1 2 2 3 2 3 2 4 2 4 2 5 3 5 2 6 3 6 2 The function counts the number of unique z values in each sub-data frame with the same x and y values. The argument d in the anonymous function is a data frame object. # data.table version: library(data.table) dX - data.table(X, key = 'x, y') dX[, list(nz = length(unique(z))), by = 'x, y'] x y nz [1,] 1 1 2 [2,] 1 2 2 [3,] 2 3 2 [4,] 2 4 2 [5,] 3 5 2 [6,] 3 6 2 The key columns sort the data by x, y combinations and then find nz in each data subset. If you intend to do a lot of summarization/data manipulation in R, these packages are worth learning. HTH, Dennis On Tue, Jan 25, 2011 at 11:25 AM, Ryan Utz utz.r...@gmail.com wrote: Hi R-users, I'm trying to find an elegant way to count the number of rows in a dataframe with a unique combination of 2 values in the dataframe. My data is specifically one column with a year, one with a month, and one with a day. I'm trying to count the number of days in each year/month combination. But for simplicity's sake, the following dataset will do: x-c(1,1,1,1,2,2,2,2,3,3,3,3) y-c(1,1,2,2,3,3,4,4,5,5,6,6) z-c(1,2,3,4,5,6,7,8,9,10,11,12) X-data.frame(x y z) So with dataset X, how would I count the number of z values (3rd column in X) with unique combinations of the first two columns (x and y)? (for instance, in the above example, there are 2 instances per unique combination of the first two columns). I can do this in Matlab and it's easy, but since I'm new to R this is royally stumping me. Thanks, Ryan -- Ryan Utz Postdoctoral research scholar University of California, Santa Barbara (724) 272 7769 [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] subsets
require(data.table) DT = as.data.table(df) # 1. Patients with ah and ihd DT[,.SD[ah%in%diagnosis ihd%in%diagnosis],by=id] id diagnosis [1,] 2ah [2,] 2 ihd [3,] 2im [4,] 4ah [5,] 4 ihd [6,] 4angina # 2. Patients with ah but no ihd DT[,.SD[ah%in%diagnosis !ihd%in%diagnosis],by=id] id diagnosis [1,] 1ah [2,] 3ah [3,] 3stroke # 3. Patients with ihd but no ah? DT[,.SD[!ah%in%diagnosis ihd%in%diagnosis],by=id] id diagnosis [1,] 5 ihd -- View this message in context: http://r.789695.n4.nabble.com/subsets-tp3227143p3233177.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Listing of available functions
Try : objects(package:base) Also, as it happens, a new package called unknownR is in development on R-Forge. It's description says : Do you know how many functions there are in base R? How many of them do you know you don't know? Run unk() to discover your unknown unknowns. It's fast and it's fun ! It's not ready to try yet (and may not live up to it's promises) but hopefully should be ready soon. Matthew Sébastien Bihorel pomc...@free.fr wrote in message news:aanlktinfpmthb2osgjckeo3jwsqhw+-zdyd0xtdmk...@mail.gmail.com... Dear R-users, Is there a easy way to access to a complete listing of available functions from a R session? The help.start() and ? functions are great, but I feel like they require the user to know the answer in advance (especially with respect to function names)... I could not find a easy way to simply browse through a list of functions and randomly pick one function to see what is does. Is there such a possibility in R? Thanks PS: I apologize if this question appears trivial. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] RGL crashes
Weyland is the project to remove X11 from Linux. http://en.wikipedia.org/wiki/Wayland_(display_server) Ubuntu chiefs have said they support Weyland and aim to include it in the next release (April 2011 == version 11.04 == Natty Narwhal). Fedora developers apparenly said that they are likely to adopt Weyland too. I don't know if packages in R such as rgl would need changing to work with Weyland, or perhaps R itself, if at all. However it seems that Linux is moving away from X11. Mentioned it here because the issue in this thread appears to be X11 specific. X11's days seem to be numbered if I understand correctly. Matthew Duncan Murdoch murdoch.dun...@gmail.com wrote in message news:4cffca13.7070...@gmail.com... Matthew Dowle wrote: Might Wayland fix it in Narwhal ? I hope those names mean something to Rainer, because they mean nothing to me. Duncan Murdoch Duncan Murdoch murdoch.dun...@gmail.com wrote in message news:4cff7177.7030...@gmail.com... On 08/12/2010 6:07 AM, Rainer M Krug wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 12/08/2010 12:05 PM, Duncan Murdoch wrote: Rainer M Krug wrote: Hi rgl crashes my R session, when resizing the rgl graphic window. I am using Ubuntu Maversick, with dual monitor setup. If I disconnect one monitor, I can resize it a little bit, but it still craches if I enlarge it to much. I assume that the problem has to do with allocated graphic memory in the kernel, but why is R crashing completely, and not evn giving the usual crash options? Cheers, Rainer sessionInfo() R version 2.12.0 (2010-10-15) Platform: i686-pc-linux-gnu (32-bit) locale: [1] LC_CTYPE=en_US.utf8 LC_NUMERIC=C [3] LC_TIME=en_US.utf8LC_COLLATE=en_US.utf8 [5] LC_MONETARY=C LC_MESSAGES=en_US.utf8 [7] LC_PAPER=en_US.utf8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.utf8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] rgl_0.92.794 version _ platform i686-pc-linux-gnu arch i686 os linux-gnu system i686, linux-gnu status major 2 minor 12.0 year 2010 month 10 day15 svn rev53317 language R version.string R version 2.12.0 (2010-10-15) After executing library(rgl) example(rgl) and resizing the graph window, R crashes witrh the following message: drmRadeonCmdBuffer: -22. Kernel failed to parse or rejected command stream. See dmesg for more info. from dmesg: [ 7349.471959] [drm:r100_cs_track_check] *ERROR* [drm] Buffer too small for color buffer 0 (need 413696 have 262144) ! [ 7349.471964] [drm:r100_cs_track_check] *ERROR* [drm] color buffer 0 (256 4 0 404) [ 7349.471967] [drm:radeon_cs_ioctl] *ERROR* Invalid command stream ! Those messages look like they're coming from your graphics driver, not from R. So rgl may be doing something it shouldn't do, but you'll probably have to diagnose what that is. It's unlikely to be reproducible on another system. That's what I fear as well - could you give me any tips on how to proceed to identify the problem? It might help to know which line of code in rgl actually triggered the error, but debugging X11 code is tricky. The function that likely triggered the problem is X11WindowImpl::setWindowRect in rgl/src/x11gui.cpp; it makes calls to X11 functions that do the actual work. Duncan Murdoch Rainer Duncan Murdoch -- Rainer M. Krug, PhD (Conservation Ecology, SUN), MSc (Conservation Biology, UCT), Dipl. Phys. (Germany) Centre of Excellence for Invasion Biology Natural Sciences Building Office Suite 2039 Stellenbosch University Main Campus, Merriman Avenue Stellenbosch South Africa Tel:+33 - (0)9 53 10 27 44 Cell: +27 - (0)8 39 47 90 42 Fax (SA): +27 - (0)8 65 16 27 82 Fax (D) : +49 - (0)3 21 21 25 22 44 Fax (FR): +33 - (0)9 58 10 27 44 email: rai...@krugs.de Skype: RMkrug __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. - -- Rainer M. Krug, PhD (Conservation Ecology, SUN), MSc (Conservation Biology, UCT), Dipl. Phys. (Germany) Centre of Excellence for Invasion Biology Natural Sciences Building Office Suite 2039 Stellenbosch University Main Campus, Merriman Avenue Stellenbosch South Africa Tel:+33 - (0)9 53 10 27 44 Cell: +27 - (0)8 39 47 90 42 Fax (SA): +27 - (0)8 65 16 27 82 Fax (D) : +49 - (0)3 21 21 25 22 44 Fax (FR): +33 - (0)9 58 10 27 44 email: rai...@krugs.de Skype: RMkrug -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.10 (GNU/Linux
Re: [R] RGL crashes
Might Wayland fix it in Narwhal ? Duncan Murdoch murdoch.dun...@gmail.com wrote in message news:4cff7177.7030...@gmail.com... On 08/12/2010 6:07 AM, Rainer M Krug wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 12/08/2010 12:05 PM, Duncan Murdoch wrote: Rainer M Krug wrote: Hi rgl crashes my R session, when resizing the rgl graphic window. I am using Ubuntu Maversick, with dual monitor setup. If I disconnect one monitor, I can resize it a little bit, but it still craches if I enlarge it to much. I assume that the problem has to do with allocated graphic memory in the kernel, but why is R crashing completely, and not evn giving the usual crash options? Cheers, Rainer sessionInfo() R version 2.12.0 (2010-10-15) Platform: i686-pc-linux-gnu (32-bit) locale: [1] LC_CTYPE=en_US.utf8 LC_NUMERIC=C [3] LC_TIME=en_US.utf8LC_COLLATE=en_US.utf8 [5] LC_MONETARY=C LC_MESSAGES=en_US.utf8 [7] LC_PAPER=en_US.utf8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.utf8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] rgl_0.92.794 version _ platform i686-pc-linux-gnu arch i686 os linux-gnu system i686, linux-gnu status major 2 minor 12.0 year 2010 month 10 day15 svn rev53317 language R version.string R version 2.12.0 (2010-10-15) After executing library(rgl) example(rgl) and resizing the graph window, R crashes witrh the following message: drmRadeonCmdBuffer: -22. Kernel failed to parse or rejected command stream. See dmesg for more info. from dmesg: [ 7349.471959] [drm:r100_cs_track_check] *ERROR* [drm] Buffer too small for color buffer 0 (need 413696 have 262144) ! [ 7349.471964] [drm:r100_cs_track_check] *ERROR* [drm] color buffer 0 (256 4 0 404) [ 7349.471967] [drm:radeon_cs_ioctl] *ERROR* Invalid command stream ! Those messages look like they're coming from your graphics driver, not from R. So rgl may be doing something it shouldn't do, but you'll probably have to diagnose what that is. It's unlikely to be reproducible on another system. That's what I fear as well - could you give me any tips on how to proceed to identify the problem? It might help to know which line of code in rgl actually triggered the error, but debugging X11 code is tricky. The function that likely triggered the problem is X11WindowImpl::setWindowRect in rgl/src/x11gui.cpp; it makes calls to X11 functions that do the actual work. Duncan Murdoch Rainer Duncan Murdoch -- Rainer M. Krug, PhD (Conservation Ecology, SUN), MSc (Conservation Biology, UCT), Dipl. Phys. (Germany) Centre of Excellence for Invasion Biology Natural Sciences Building Office Suite 2039 Stellenbosch University Main Campus, Merriman Avenue Stellenbosch South Africa Tel:+33 - (0)9 53 10 27 44 Cell: +27 - (0)8 39 47 90 42 Fax (SA): +27 - (0)8 65 16 27 82 Fax (D) : +49 - (0)3 21 21 25 22 44 Fax (FR): +33 - (0)9 58 10 27 44 email: rai...@krugs.de Skype: RMkrug __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. - -- Rainer M. Krug, PhD (Conservation Ecology, SUN), MSc (Conservation Biology, UCT), Dipl. Phys. (Germany) Centre of Excellence for Invasion Biology Natural Sciences Building Office Suite 2039 Stellenbosch University Main Campus, Merriman Avenue Stellenbosch South Africa Tel:+33 - (0)9 53 10 27 44 Cell: +27 - (0)8 39 47 90 42 Fax (SA): +27 - (0)8 65 16 27 82 Fax (D) : +49 - (0)3 21 21 25 22 44 Fax (FR): +33 - (0)9 58 10 27 44 email: rai...@krugs.de Skype: RMkrug -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.10 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAkz/ZuUACgkQoYgNqgF2egoPDwCfYQqfotaTxJ2dkFDMqrVt/Kzr /REAmwQIWe2N3iiFxYYjCEcaPYgTx8As =VpUe -END PGP SIGNATURE- __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] fast subsetting of lists in lists
Hello Alex, Assuming it was just an inadequate example (since a data.frame would suffice in that case), did you know that a data.frames' columns do not have to be vectors but can be lists? I don't know if that helps. DF = data.frame(a=1:3) DF$b = list(pi, 2:3, letters[1:5]) DF a b 1 1 3.141593 2 2 2, 3 3 3 a, b, c, d, e DF$b [[1]] [1] 3.141593 [[2]] [1] 2 3 [[3]] [1] a b c d e sapply(DF,class) a b integerlist That is still regular though in the sense that each row has a value for all the columns, even if that value is NA, or NULL in lists. If your data is not regular then one option is to flatten it into (row,column,value) tuple similar to how sparse matrices are stored. Your value column may be list rather than vector. Then (and yes you guessed this was coming) ... you can use data.table to query the flat structure quickly by setting a key on the first two columns, or maybe just the 2nd column when you need to pick out the values for one 'column' quickly for all 'rows'. There was a thread about using list() columns in data.table here : http://r.789695.n4.nabble.com/Suggest-a-cool-feature-Use-data-table-like-a-sorted-indexed-data-list-tp2544213p2544213.html Does someone now a trick to do the same as above with the faster built-in subsetting? Something like: test[somesubsettingmagic] So in data.table if you wanted all the 'b' values, you might do something like this : setkey(DT,column) DT[J(b), value] which should return the list() quickly from the irregular data. Matthew Alexander Senger sen...@physik.hu-berlin.de wrote in message news:4cfe6aee.6030...@physik.hu-berlin.de... Hello Gerrit, Gabor, thank you for your suggestion. Unfortunately unlist seems to be rather expensive. A short test with one of my datasets gives 0.01s for an extraction based on my approach and 5.6s for unlist alone. The reason seems to be that unlist relies on lapply internally and does so recursively? Maybe there is still another way to go? Alex Am 07.12.2010 15:59, schrieb Gerrit Eichner: Hello, Alexander, does utest - unlist(test) utest[ names( utest) == a] come close to what you need? Hth, Gerrit On Tue, 7 Dec 2010, Alexander Senger wrote: Hello, my data is contained in nested lists (which seems not necessarily to be the best approach). What I need is a fast way to get subsets from the data. An example: test - list(list(a = 1, b = 2, c = 3), list(a = 4, b = 5, c = 6), list(a = 7, b = 8, c = 9)) Now I would like to have all values in the named variables a, that is the vector c(1, 4, 7). The best I could come up with is: val - sapply(1:3, function (i) {test[[i]]$a}) which is unfortunately not very fast. According to R-inferno this is due to the fact that apply and its derivates do looping in R rather than rely on C-subroutines as the common [-operator. Does someone now a trick to do the same as above with the faster built-in subsetting? Something like: test[somesubsettingmagic] Thank you for your advice Alex __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Performance tuning tips when working with wide datasets
Richard, Try data.table. See the introduction vignette and the presentations e.g. there is a slide showing a join to 183,000,000 observations of daily stock prices in 0.002 seconds. data.table has fast rolling joins (i.e. fast last observation carried forward) too. I see you asked about that on this list on 8 Nov. Also see fast aggregations using 'by' on a key()-ed in-memory table. I wonder if your 20,000 columns are always populated for all rows. If not then consider collapsing to a 3 column table (row,col,data) and then joining to that. You may have that format in your original data source anyway, so you may be able to skip a step you may have implemented already which expands that format to wide. In other words, keeping it narrow may be an option (like how a sparse matrix is stored). Matthew http://datatable.r-forge.r-project.org/ Richard Vlasimsky richard.vlasim...@imidex.com wrote in message news:2e042129-4430-4c66-9308-a36b761eb...@imidex.com... Does anyone have any performance tuning tips when working with datasets that are extremely wide (e.g. 20,000 columns)? In particular, I am trying to perform a merge like below: merged_data - merge(data1, data2, by.x=date,by.y=date,all=TRUE,sort=TRUE); This statement takes about 8 hours to execute on a pretty fast machine. The dataset data1 contains daily data going back to 1950 (20,000 rows) and has 25 columns. The dataset data2 contains annual data (only 60 observations), however there are lots of columns (20,000 of them). I have to do a lot of these kinds of merges so need to figure out a way to speed it up. I have tried a number of different things to speed things up to no avail. I've noticed that rbinds execute much faster using matrices than dataframes. However the performance improvement when using matrices (vs. data frames) on merges were negligible (8 hours down to 7). I tried casting my merge field (date) into various different data types (character, factor, date). This didn't seem to have any effect. I tried the hash package, however, merge couldn't coerce the class into a data.frame. I've tried various ways to parellelize computation in the past, and found that to be problematic for a variety of reasons (runaway forked processes, doesn't run in a GUI environment, doesn't run on Macs, etc.). I'm starting to run out of ideas, anyone? Merging a 60 row dataset shouldn't take that long. Thanks, Richard __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Finding the nearest data in intraday data from two zoo objects
Try data.table with the roll=TRUE argument. Set your keys and then write : futData[optData,roll=TRUE] That is fast and as you can see, short. Works on many millions and even billions of rows in R. Matthew http://datatable.r-forge.r-project.org/ Santosh Srinivas santosh.srini...@gmail.com wrote in message news:4ced3783.2af98e0a.57f0.b...@mx.google.com... Hello Group, I have the following options and future data in zoo objects head(optData.z) ExpDt OptTyp Strike TrdPrice TotTrdQty 2009-01-01 09:55:03 20090129 1 2900 180.50 2009-01-01 09:55:31 20090129 1 2900 188.50 2009-01-01 09:55:37 20090129 1 2900 185. 500 2009-01-01 09:55:39 20090129 1 2900 185. 500 2009-01-01 09:55:47 20090129 1 2900 185.1125 600 2009-01-01 09:55:48 20090129 1 2900 185.250050 head(futData.z) ExpDt OptTyp Strike TrdPrice TotTrdQty 2009-01-01 09:55:09 20090129 2 0 2979.000 900 2009-01-01 09:55:11 20090129 2 0 2976.633 600 2009-01-01 09:55:12 20090129 2 0 2977.211 900 2009-01-01 09:55:14 20090129 2 0 2977.750 800 2009-01-01 09:55:15 20090129 2 0 2977.019 4300 2009-01-01 09:55:16 20090129 2 0 2977.050 800 I want to get the closest available futures price for every option ... Is there any function like the excel equivalent of approximate VLOOKUP of excel using date time? Thank you. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Sorting and subsetting
All the solutions in this thread so far use the lapply(split(...)) paradigm either directly or indirectly. That paradigm doesn't scale. That's the likely source of quite a few 'out of memory' errors and performance issues in R. data.table doesn't do that internally, and it's syntax is pretty easy. tmp - data.table(index = gl(2,20), foo = rnorm(40)) tmp[, .SD[head(order(-foo),5)], by=index] index index.1 foo [1,] 1 1 1.9677303 [2,] 1 1 1.2731872 [3,] 1 1 1.1100931 [4,] 1 1 0.8194719 [5,] 1 1 0.6674880 [6,] 2 2 1.2236383 [7,] 2 2 0.9606766 [8,] 2 2 0.8654497 [9,] 2 2 0.5404112 [10,] 2 2 0.3373457 As you can see it currently repeats the group column which is a shame (on the to do list to fix). Matthew http://datatable.r-forge.r-project.org/ -- View this message in context: http://r.789695.n4.nabble.com/Sorting-and-subsetting-tp2547360p2548319.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Sorting and subsetting
Probably true, thats cunning, but look at base::match. The first thing it does is coerce factor to character (an allocate and copy needed internally). data.table doesn't do that either, see data.table:::sortedmatch. I made first basic steps towards a proper reproducible test suite (timings.Rnw). Perhaps this example could be added there; PDF is on the homepage. One test is 340 times faster and the other is 13 times faster. More examples would be good. Matthew http://datatable.r-forge.r-project.org/ Joshua Wiley jwiley.ps...@gmail.com wrote in message news:aanlktimyuvl9suj65ktzqvpnyn+ep8ubu3mxxhhrd...@mail.gmail.com... On Tue, Sep 21, 2010 at 3:09 AM, Matthew Dowle mdo...@mdowle.plus.com wrote: All the solutions in this thread so far use the lapply(split(...)) paradigm either directly or indirectly. That paradigm doesn't scale. That's the likely source of quite a few 'out of memory' errors and performance issues in R. This is a good point. It is not nearly as straightforward as the syntax for data.table (which seems to order and select in one step...very nice!), but this should be less memory intensive: tmp - data.frame(index = gl(2,20), foo = rnorm(40)) tmp - tmp[order(tmp$index, tmp$foo) , ] # find location of first instance of each level and add 0:4 to it x - sapply(match(levels(tmp$index), tmp$index), `+`, 0:4) tmp[x, ] data.table doesn't do that internally, and it's syntax is pretty easy. tmp - data.table(index = gl(2,20), foo = rnorm(40)) tmp[, .SD[head(order(-foo),5)], by=index] index index.1 foo [1,] 1 1 1.9677303 [2,] 1 1 1.2731872 [3,] 1 1 1.1100931 [4,] 1 1 0.8194719 [5,] 1 1 0.6674880 [6,] 2 2 1.2236383 [7,] 2 2 0.9606766 [8,] 2 2 0.8654497 [9,] 2 2 0.5404112 [10,] 2 2 0.3373457 As you can see it currently repeats the group column which is a shame (on the to do list to fix). Matthew http://datatable.r-forge.r-project.org/ -- View this message in context: http://r.789695.n4.nabble.com/Sorting-and-subsetting-tp2547360p2548319.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Joshua Wiley Ph.D. Student, Health Psychology University of California, Los Angeles http://www.joshuawiley.com/ __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Sorting and subsetting
See data.table:::duplist which does that (or at least very similar) in C, for multiple columns too. Matthew http://datatable.r-forge.r-project.org/ peter dalgaard pda...@gmail.com wrote in message news:660991c3-b52b-4d58-b819-eadc95ecc...@gmail.com... On Sep 21, 2010, at 16:27 , Joshua Wiley wrote: On Tue, Sep 21, 2010 at 3:09 AM, Matthew Dowle mdo...@mdowle.plus.com wrote: All the solutions in this thread so far use the lapply(split(...)) paradigm either directly or indirectly. That paradigm doesn't scale. That's the likely source of quite a few 'out of memory' errors and performance issues in R. This is a good point. It is not nearly as straightforward as the syntax for data.table (which seems to order and select in one step...very nice!), but this should be less memory intensive: tmp - data.frame(index = gl(2,20), foo = rnorm(40)) tmp - tmp[order(tmp$index, tmp$foo) , ] # find location of first instance of each level and add 0:4 to it x - sapply(match(levels(tmp$index), tmp$index), `+`, 0:4) tmp[x, ] That will get you in trouble if any group has size less than 5, though. Something involving duplicated() could work; you just need to generate the sawtooth sequence: 0,1,2,3,4,0,1,2,3,4,5,6,0,1,2,... and select values less than or equal 4. I _think_ this should work (it does on the airquality dataframe, anyway): ix - tmp$index s - seq_along(ix) j - diff(s[!duplicated(ix)]) s2 - rep.int(0, length(s)) s2[!duplicated(ix)] - c(1,j) d - s - cumsum(s2) tmp[d 5,] Or, another version of the same idea, giving teeth starting at 1 instead d - s - c(0,cumsum(table(ix)))[factor(ix)] tmp[d = 5, ] (There are times when I contemplate writing a DATAstep() function, this is one of those things that are straightforward in the SAS sequential processing paradigm. Of course there are things that are much more complicated in SAS, too.) data.table doesn't do that internally, and it's syntax is pretty easy. tmp - data.table(index = gl(2,20), foo = rnorm(40)) tmp[, .SD[head(order(-foo),5)], by=index] index index.1 foo [1,] 1 1 1.9677303 [2,] 1 1 1.2731872 [3,] 1 1 1.1100931 [4,] 1 1 0.8194719 [5,] 1 1 0.6674880 [6,] 2 2 1.2236383 [7,] 2 2 0.9606766 [8,] 2 2 0.8654497 [9,] 2 2 0.5404112 [10,] 2 2 0.3373457 As you can see it currently repeats the group column which is a shame (on the to do list to fix). Matthew http://datatable.r-forge.r-project.org/ -- View this message in context: http://r.789695.n4.nabble.com/Sorting-and-subsetting-tp2547360p2548319.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Joshua Wiley Ph.D. Student, Health Psychology University of California, Los Angeles http://www.joshuawiley.com/ __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Peter Dalgaard Center for Statistics, Copenhagen Business School Solbjerg Plads 3, 2000 Frederiksberg, Denmark Phone: (+45)38153501 Email: pd@cbs.dk Priv: pda...@gmail.com __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Pass By Value Questions
To: r-help Cc: Jeff, Matt, Duncan, Hadley [ using Nabble to cc ] Jeff, Matt, How about the 'refdata' class in package ref. Also, Hadley's immutable data.frame in plyr 1.1. Both allow you to refer to subsets of a data.frame or matrix by reference I believe, if I understand correctly. Matthew http://datatable.r-forge.r-project.org/ -- View this message in context: http://r.789695.n4.nabble.com/Pass-By-Value-Questions-tp2331565p2332330.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] coef(summary) and plyr
Another option for consideration : library(data.table) mydt = as.data.table(mydf) mydt[,as.list(coef(lm(y~x1+x2+x3))),by=fac] fac X.Intercept. x1 x2x3 [1,] 0 -0.16247059 1.130220 2.988769 -19.14719 [2,] 1 0.08224509 1.216673 2.847960 -19.16105 [3,] 2 0.02052320 1.135421 3.134154 -19.22555 mydt[,data.table(coef(summary(lm(y~x1+x2+x3))),keep.rownames=TRUE), by=fac] fac rn Estimate Std..Error t.value Pr...t.. [1,] 0 (Intercept) -0.16247059 0.1521507 -1.0678269 2.929087e-01 [2,] 0 x1 1.13021985 0.13740208.2256414 1.079035e-09 [3,] 0 x2 2.98876920 0.1404903 21.2738533 1.325909e-21 [4,] 0 x3 -19.14719151 0.1335139 -143.4096890 4.520371e-50 [5,] 1 (Intercept) 0.08224509 0.23606640.3483981 7.313719e-01 [6,] 1 x1 1.21667349 0.27232014.4678058 2.637743e-04 [7,] 1 x2 2.84796003 0.2232960 12.7541904 9.192555e-11 [8,] 1 x3 -19.16104669 0.2394431 -80.0233818 1.707058e-25 [9,] 2 (Intercept) 0.02052320 0.19025260.1078734 9.147302e-01 [10,] 2 x1 1.13542085 0.17863336.3561559 2.980475e-07 [11,] 2 x2 3.13415398 0.1894404 16.5442781 7.827178e-18 [12,] 2 x3 -19.22554984 0.1708307 -112.5415605 2.536686e-45 http://datatable.r-forge.r-project.org/ Matthew -- View this message in context: http://r.789695.n4.nabble.com/coef-summary-and-plyr-tp2318460p2319068.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Finding points where two timeseries cross over
Is this what you mean? x=c(1,2,2,3,4,5,6,3,2,1) y=c(2,3,4,2,1,2,3,4,5,6) matplot(cbind(x,y),type=l) which(diff(sign(x-y))!=0)+1 [1] 4 8 -- View this message in context: http://r.789695.n4.nabble.com/Finding-points-where-two-timeseries-cross-over-tp2313257p2313510.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] long to wide on larger data set
Juliet, I've been corrected off list. I did not read properly that you are on 64bit. The calculation should be : 53860858 * 4 * 8 /1024^3 = 1.6GB since pointers are 8 bytes on 64bit. Also, data.table is an add-on package so I should have included : install.packages(data.table) require(data.table) data.table is available on all platforms both 32bit and 64bit. Please forgive mistakes: 'someoone' should be 'someone', 'percieved' should be 'perceived' and 'testDate' should be 'testData' at the end. The rest still applies, and you might have a much easier time than I thought since you are on 64bit. I was working on the basis of squeezing into 32bit. Matthew Matthew Dowle mdo...@mdowle.plus.com wrote in message news:i1faj2$lv...@dough.gmane.org... Hi Juliet, Thanks for the info. It is very slow because of the == in testData[testData$V2==one_ind,] Why? Imagine someoone looks for 10 people in the phone directory. Would they search the entire phone directory for the first person's phone number, starting on page 1, looking at every single name, even continuing to the end of the book after they had found them ? Then would they start again from page 1 for the 2nd person, and then the 3rd, searching the entire phone directory from start to finish for each and every person ? That code using == does that. Some of us call that a 'vector scan' and is a common reason for R being percieved as slow. To do that more efficiently try this : testData = as.data.table(testData) setkey(testData,V2)# sorts data by V2 for (one_ind in mysamples) { one_sample - testData[one_id,] reshape(one_sample) } or just this : testData = as.data.table(testData) setkey(testDate,V2) testData[,reshape(.SD,...), by=V2] That should solve the vector scanning problem, and get you on to the memory problems which will need to be tackled. Since the 4 columns are character, then the object size should be roughly : 53860858 * 4 * 4 /1024^3 = 0.8GB That is more promising to work with in 32bit so there is hope. [ That 0.8GB ignores the (likely small) size of the unique strings in global string hash (depending on your data). ] Its likely that the as.data.table() fails with out of memory. That is not data.table but unique. There is a change in unique.c in R 2.12 which makes unique more efficient and since factor calls unique, it may be necessary to use R 2.12. If that still doesn't work, then there are several more tricks (and we will need further information), and there may be some tweaks needed to that code as I didn't test it, but I think it should be possible in 32bit using R 2.12. Is it an option to just keep it in long format and use a data.table ? testDate[, somecomplexrfunction(onecolumn, anothercolumn), by=list(V2) ] Why you you need to reshape from long to wide ? HTH, Matthew Juliet Hannah juliet.han...@gmail.com wrote in message news:aanlktinyvgmrvdp0svc-fylgogn2ro0omnugqbxx_...@mail.gmail.com... Hi Jim, Thanks for responding. Here is the info I should have included before. I should be able to access 4 GB. str(myData) 'data.frame': 53860857 obs. of 4 variables: $ V1: chr 23 26 200047 200050 ... $ V2: chr cv0001 cv0001 cv0001 cv0001 ... $ V3: chr A A A B ... $ V4: chr B B A B ... sessionInfo() R version 2.11.0 (2010-04-22) x86_64-unknown-linux-gnu locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=C LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base On Mon, Jul 12, 2010 at 7:54 AM, jim holtman jholt...@gmail.com wrote: What is the configuration you are running on (OS, memory, etc.)? What does your object consist of? Is it numeric, factors, etc.? Provide a 'str' of it. If it is numeric, then the size of the object is probably about 1.8GB. Doing the long to wide you will probably need at least that much additional memory to hold the copy, if not more. This would be impossible on a 32-bit version of R. On Mon, Jul 12, 2010 at 1:25 AM, Juliet Hannah juliet.han...@gmail.com wrote: I have a data set that has 4 columns and 53860858 rows. I was able to read this into R with: cc - rep(character,4) myData - read.table(myData.csv,header=FALSE,skip=1,colClasses=cc,nrow=53860858,sep=,) I need to reshape this data from long to wide. On a small data set the following lines work. But on the real data set, it didn't finish even when I took a sample of two (rows in new data). I didn't receive an error. I just stopped it because it was taking too long. Any suggestions for improvements? Thanks. # start example # i have commented out the write.table statement below testData - read.table
Re: [R] long to wide on larger data set
Hi Juliet, Thanks for the info. It is very slow because of the == in testData[testData$V2==one_ind,] Why? Imagine someoone looks for 10 people in the phone directory. Would they search the entire phone directory for the first person's phone number, starting on page 1, looking at every single name, even continuing to the end of the book after they had found them ? Then would they start again from page 1 for the 2nd person, and then the 3rd, searching the entire phone directory from start to finish for each and every person ? That code using == does that. Some of us call that a 'vector scan' and is a common reason for R being percieved as slow. To do that more efficiently try this : testData = as.data.table(testData) setkey(testData,V2)# sorts data by V2 for (one_ind in mysamples) { one_sample - testData[one_id,] reshape(one_sample) } or just this : testData = as.data.table(testData) setkey(testDate,V2) testData[,reshape(.SD,...), by=V2] That should solve the vector scanning problem, and get you on to the memory problems which will need to be tackled. Since the 4 columns are character, then the object size should be roughly : 53860858 * 4 * 4 /1024^3 = 0.8GB That is more promising to work with in 32bit so there is hope. [ That 0.8GB ignores the (likely small) size of the unique strings in global string hash (depending on your data). ] Its likely that the as.data.table() fails with out of memory. That is not data.table but unique. There is a change in unique.c in R 2.12 which makes unique more efficient and since factor calls unique, it may be necessary to use R 2.12. If that still doesn't work, then there are several more tricks (and we will need further information), and there may be some tweaks needed to that code as I didn't test it, but I think it should be possible in 32bit using R 2.12. Is it an option to just keep it in long format and use a data.table ? testDate[, somecomplexrfunction(onecolumn, anothercolumn), by=list(V2) ] Why you you need to reshape from long to wide ? HTH, Matthew Juliet Hannah juliet.han...@gmail.com wrote in message news:aanlktinyvgmrvdp0svc-fylgogn2ro0omnugqbxx_...@mail.gmail.com... Hi Jim, Thanks for responding. Here is the info I should have included before. I should be able to access 4 GB. str(myData) 'data.frame': 53860857 obs. of 4 variables: $ V1: chr 23 26 200047 200050 ... $ V2: chr cv0001 cv0001 cv0001 cv0001 ... $ V3: chr A A A B ... $ V4: chr B B A B ... sessionInfo() R version 2.11.0 (2010-04-22) x86_64-unknown-linux-gnu locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=C LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base On Mon, Jul 12, 2010 at 7:54 AM, jim holtman jholt...@gmail.com wrote: What is the configuration you are running on (OS, memory, etc.)? What does your object consist of? Is it numeric, factors, etc.? Provide a 'str' of it. If it is numeric, then the size of the object is probably about 1.8GB. Doing the long to wide you will probably need at least that much additional memory to hold the copy, if not more. This would be impossible on a 32-bit version of R. On Mon, Jul 12, 2010 at 1:25 AM, Juliet Hannah juliet.han...@gmail.com wrote: I have a data set that has 4 columns and 53860858 rows. I was able to read this into R with: cc - rep(character,4) myData - read.table(myData.csv,header=FALSE,skip=1,colClasses=cc,nrow=53860858,sep=,) I need to reshape this data from long to wide. On a small data set the following lines work. But on the real data set, it didn't finish even when I took a sample of two (rows in new data). I didn't receive an error. I just stopped it because it was taking too long. Any suggestions for improvements? Thanks. # start example # i have commented out the write.table statement below testData - read.table(textConnection(rs853,cv0084,A,A rs86,cv0084,C,B rs883,cv0084,E,F rs853,cv0085,G,H rs86,cv0085,I,J rs883,cv0085,K,L),header=FALSE,sep=,) closeAllConnections() mysamples - unique(testData$V2) for (one_ind in mysamples) { one_sample - testData[testData$V2==one_ind,] mywide - reshape(one_sample, timevar = V1, idvar = V2,direction = wide) # write.table(mywide,file =newdata.txt,append=TRUE,row.names=FALSE,col.names=FALSE,quote=FALSE) } __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Jim Holtman Cincinnati, OH +1 513 646 9390 What is the problem that you are trying
Re: [R] Query about using timestamps returned by SQL as 'factor' forsplit
Hi Ted, Well since you mentioned data.table (!) ... If risk_input is a data.table consisting of 3 columns (m_id, sale_date, return_date) where the dates are of class IDate (recently added to data.table by Tom) then try : risk_input[, fitdistr(return_date-sale_date,normal), by=list(m_id, year(sale_date), week(sale_date))] Notice that the 'by' can contain expressions of columns, and lets you group by more than one expression. You don't have to repeat the 'group by' expressions in the select, as you would do in SQL. data.table returns those group columns automatically in the result, alongside the result of the j expression applied to each group. If you need to aggregate by m_id, year and month rather than week another way is : risk_input[, fitdistr(return_date-sale_date,normal), by=list(m_id, round(sale_date,month))] plyr and sqldf can do this task too by the way, and I'd highly recommend you take a look at those packages. There are also many excellent datetime classes around which you could also consider. The reason we need IDate in data.table is because data.table uses radix sorting, see ?sort.list. That is ultra fast for integers. Again radix is something Tom added to data.table. The radix algorithm (see wikipedia) is specifically designed to sort integers only. We would use Date, but that is stored as numeric. IDate is the same as Date but stored as integer. HTH, Matthew Ted Byers r.ted.by...@gmail.com wrote in message news:aanlktinchf3tfzkndcwolrwsxekgpfpjes3f8m5tq...@mail.gmail.com... I have a simple query as follows: SELECT m_id,sale_date,YEAR(sale_date),WEEK(sale_date),return_type,DATEDIFF(return_date,sale_date) AS elapsed_time FROM risk_input I can get, and view, all the data that that query returns. The question is, sale_date is a timestamp, and I need to call split to group this data by m_id and the week in which the sale occurred. Obviously, I would normally need both YEAR and WEEK so that data from April this year is not combined with that from last year (the system is non-autonomous). And then I need to use lapply to apply fitdistr to each subsample. Obviously, I can handle all this data in either a data.frame or in a data.table. There are two aspects of the question. 1) Is there a function (or package) that will let me group (or regroup) time series data into the week in which the data apply, properly taking into account the year that applies, in a single call passing sale_date as the argument? If I can, then I can reduce the amount of data I draw from my MySQL server and the computational load it bears. 2) The example provided for split splits only according to a single variable (*g - airquality$Month;l - split(airquality, g)*). How would that example be changed if there were two or more columns in the data.frame that are needed to define the groups? I.E. in my example, I'd need to group by m_id, and the year and week values that can be computed from sale_date. Thanks Ted [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Performance enhancement for ave
dt = data.table(d,key=grp1,grp2) system.time(ans1 - dt[ , list(mean(x),mean(y)) , by=list(grp1,grp2)]) user system elapsed 3.890.003.91# your 7.064 is 12.23 for me though, so this 3.9 should be faster for you However, Rprof() shows that 3.9 is mostly dispatch of mean to mean.default which then calls .Internal. Because there are so many groups here, dispatch bites. So ... system.time(ans2 - dt[ , list(.Internal(mean(x)),.Internal(mean(y))), by=list(grp1,grp2)]) user system elapsed 0.200.000.21 identical(ans1,ans2) TRUE Hadley Wickham had...@rice.edu wrote in message news:aanlktilh_-3_cycf_fnqmhh6w2og5jj5u0yopx_qa...@mail.gmail.com... library(plyr) n-10 grp1-sample(1:750, n, replace=T) grp2-sample(1:750, n, replace=T) d-data.frame(x=rnorm(n), y=rnorm(n), grp1=grp1, grp2=grp2) system.time({ d$avx1 - ave(d$x, list(d$grp1, d$grp2)) d$avy1 - ave(d$y, list(d$grp1, d$grp2)) }) # user system elapsed # 39.300 0.279 40.809 system.time({ d$avx2 - ave(d$x, interaction(d$grp1, d$grp2, drop = T)) d$avy2 - ave(d$y, interaction(d$grp1, d$grp2, drop = T)) }) # user system elapsed # 6.735 0.209 7.064 all.equal(d$avy1, d$avy2) # TRUE all.equal(d$avx1, d$avx2) # TRUE i.e. ave should use g - interaction(..., drop = TRUE) Hadley -- Assistant Professor / Dobelman Family Junior Chair Department of Statistics / Rice University http://had.co.nz/ __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] lapply or data.table to find a unit's previous transaction
William, Try a rolling join in data.table, something like this (untested) : setkey(Data, UnitID, TranDt)# sort by unit then date previous = transform(Data, TranDt=TranDt-1) Data[previous,roll=TRUE]# lookup the prevailing date before, if any, for each row within that row's UnitID Thats all it is, no loops required. That should be fast and memory efficient. 100's of times faster than a subquery in SQL. If you have trouble please follow up on datatable-help. Matthew William Rogers whroger...@gmail.com wrote in message news:aanlktikk_avupm7j108iseryo9fucpnjhanxpaqvt...@mail.gmail.com... I have a dataset of property transactions that includes the transaction ID (TranID), property ID (UnitID), and transaction date (TranDt). I need to create a data frame (or data table) that includes the previous transaction date, if one exists. This is an easy problem in SQL, where I just run a sub-query, but I'm trying to make R my one-stop-shopping program. The following code works on a subset of my data, but I can't run this on my full dataset because my computer runs out of memory after about 30 minutes. (Using a 32-bit machine.) Use the following synthetic data for example. n- 100 TranID- lapply(n:(2*n), function(x) ( as.matrix(paste(x, sample(seq(as.Date('2000-01-01'), as.Date('2010-01-01'), days), sample(1:5, 1)), sep= D), ncol= 1))) TranID- do.call(rbind, TranID) UnitID- substr(TranID, 1, nchar(n)) TranDt- substr(TranID, nchar(n)+2, nchar(n)+11) Data- data.frame(TranID= TranID, UnitID= UnitID, TranDt= as.Date(TranDt)) #First I create a list of all the previous transactions by unit TranList- as.matrix(Data$TranID, ncol= 1) PreTran- lapply(TranList, function(x) (with(Data, Data[ UnitID== substr(x, 1, nchar(n)) TranDt Data[TranID== x, TranDt], ] )) ) #I do get warnings about missing data because some transactions have no predecessor. #Some transactions have no previous transactions, others have many so I pick the most recent BeforeTran- lapply(seq_along(PreTran), function(x) ( with(PreTran[[x]], PreTran[[x]][which(TranDt== max(TranDt)), ]))) #I need to add the current transaction's TranID to the list so I can merge later BeforeTran- lapply(seq_along(PreTran), function(x) ( transform(BeforeTran[[x]], TranID= TranList[x, 1]))) #Finally, I convert from a list to a data frame BeforeTran- do.call(rbind, BeforeTran) #I have used a combination of data.table and for loops, but that seems cheesey and doesn't preform much better. library(data.table) #First I create a list of all the previous transactions by unit TranList2- vector(nrow(Data), mode= list) names(TranList2)- levels(Data$TranID) DataDT- data.table(Data) #Use a for loop and data.table to find the date of the previous transaction for (i in levels(Data$TranID)) { if (DataDT[UnitID== substr(i, 1, nchar(n)) TranDt= (DataDT[TranID== i, TranDt]), length(TranDt)] 1) TranList2[[i]]- cbind(TranID= i, DataDT[UnitID== substr(i, 1, nchar(n)) TranDt (DataDT[TranID== i, TranDt]), list(TranDt= max(TranDt))]) } #Finally, I convert from a list to a data table BeforeTran2- do.call(rbind, TranList2) #My intution says that this code doesn't take advantage of data.table's attributes. #Are there any ideas out there? Thank you. #P.S. I've tried plyr and it does not help my memory problem. -- William H. Rogers __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] [R-pkgs] data.table 1.4.1 now on CRAN
data.table is an enhanced data.frame with fast subset, fast grouping and fast merge. It uses a short and flexible syntax which extends existing R concepts. Example: DT[a3,sum(b*c),by=d] where DT is a data.table with 4 columns (a,b,c,d). data.table 1.4.1 : * grouping is now 10+ times faster than tapply() * extract is 100+ times faster than ==, as before * 3 new vignettes: Intro, FAQ Timings * NEWS file contains further details http://datatable.r-forge.r-project.org/ http://cran.r-project.org/web/packages/data.table/index.html There is a new mailing list, datatable-help. Please do send comments, feedback, problems and questions. Matthew and Tom ___ R-packages mailing list r-packa...@r-project.org https://stat.ethz.ch/mailman/listinfo/r-packages __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Using plyr::dply more (memory) efficiently?
I don't know about that, but try this : install.packages(data.table, repos=http://R-Forge.R-project.org;) require(data.table) summaries = data.table(summaries) summaries[,sum(counts),by=symbol] Please let us know if that returns the correct result, and if its memory/speed is ok ? Matthew Steve Lianoglou mailinglist.honey...@gmail.com wrote in message news:w2kbbdc7ed01004290606lc425e47cs95b36f6bf0a...@mail.gmail.com... Hi all, In short: I'm running ddply on an admittedly (somehow) large data.frame (not that large). It runs fine until it finishes and gets to the collating part where all subsets of my data.frame have been summarized and they are being reassembled into the final summary data.frame (sorry, don't know the correct plyr terminology). During collation, my R workspace RAM usage goes from about 1.5 GB upto 20GB until I kill it. Running a similar piece of code that iterates manually w/o ddply by using a combo of lapply and a do.call(rbind, ...) uses considerably less ram (tops out at about 8GB). How can I use ddply more efficiently? Longer: Here's more info: * The data.frame itself ~ 15.8 MB when loaded. * ~ 400,000 rows, 8 columns It looks like so: exon.start exon.width exon.width.unique exon.anno counts symbol transcript chr 14225468 0 utr 0 WASH5P WASH5P chr1 24833 69 0 utr 1 WASH5P WASH5P chr1 3565915238 utr 1 WASH5P WASH5P chr1 46470159 0 utr 0 WASH5P WASH5P chr1 56721198 0 utr 0 WASH5P WASH5P chr1 67096136 0 utr 0 WASH5P WASH5P chr1 77469137 0 utr 0 WASH5P WASH5P chr1 87778147 0 utr 0 WASH5P WASH5P chr1 98131 99 0 utr 0 WASH5P WASH5P chr1 10 14601154 0 utr 0 WASH5P WASH5P chr1 11 19184 50 0 utr 0 WASH5P WASH5P chr1 12 469314036intron 2 WASH5P WASH5P chr1 13 490275736intron 1 WASH5P WASH5P chr1 14 5811659 144intron 47 WASH5P WASH5P chr1 15 6629 9221intron 1 WASH5P WASH5P chr1 16 6919177 0intron 0 WASH5P WASH5P chr1 17 723223735intron 2 WASH5P WASH5P chr1 18 7606172 0intron 0 WASH5P WASH5P chr1 19 7925206 0intron 0 WASH5P WASH5P chr1 20 8230 6371 109intron 67 WASH5P WASH5P chr1 21 14755 442955intron 12 WASH5P WASH5P chr1 ... I'm ply-ing over the transcript column and the function transforms each such subset of the data.frame into a new data.frame that is just 1 row / transcript that basically has the sum of the counts for each transcript. The code would look something like this (`summaries` is the data.frame I'm referring to): rpkm - ddply(summaries, .(transcript), function(df) { data.frame(symbol=df$symbol[1], counts=sum(df$counts)) } (It actually calculates 2 more columns that are returned in the data.frame, but I'm not sure that's really important here). To test some things out, I've written another function to manually iterate/create subsets of my data.frame to summarize. I'm using sqldf to dump the data.frame into a db, then I lapply over subsets of the db `where transcript=x` to summarize each subset of my data into a list of single-row data.frames (like ddply is doing), and finish with a `do.call(rbind, the.dfs)` o nthis list. This returns the same exact result ddply would return, and by the time `do.call` finishes, my RAM usage hits about 8gb. So, what am I doing wrong with ddply that makes the difference ram usage in the last step (collation -- the equivalent of my final `do.call(rbind, my.dfs)` be more than 12GB? Thanks, -steve -- Steve Lianoglou Graduate Student: Computational Systems Biology | Memorial Sloan-Kettering Cancer Center | Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Using plyr::dply more (memory) efficiently?
Steve Lianoglou mailinglist.honey...@gmail.com wrote in message news:t2ybbdc7ed01004290812n433515b5vb15b49c170f5a...@mail.gmail.com... Thanks for directing me to the data.table package. I read through some of the vignettes, and it looks quite nice. While your sample code would provide answer if I wanted to just compute some summary statistic/function of groups of my data.frame (using `by=symbol`), what's the best way to produces several pieces of info per subset. For instance, I see that I can do something like this: summaries[, list(counts=sum(counts), width=sum(exon.width)), by=symbol] Yes, thats it. But what if I need to do some more complex processing within the subsets defined in `by=symbol` -- like several lines of programming logic for 1 result, say. I guess I can open a new block that just returns a data.table? Like: summaries[, { cnts - sum(counts) ew - sum(exon.width) # ... some complex things complex - # .. result of complex things data.table(counts=cnts, width=ew, cplx=complex) }, by=symbol] Is that right? (I mean, it looks like it's working, but maybe there's a more idiomatic way(?)) Yes, you got it. Rather than a data.table at the end though, just return a list, its faster. Shorter vectors will still be recycled to match any longer ones. Or just this : summaries[, list( counts = sum(counts), width = sum(exon.width), cplx = # .. result of complex things ), by=symbol] Sounds like its working, but could you give us an idea whether it is quick and memory efficient ? __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] sum specific rows in a data frame
Or try data.table 1.4 on r-forge, its grouping is faster than aggregate : agg datatable X100.012 0.008 X100 0.020 0.008 X1000 0.172 0.020 X1 1.164 0.144 X1e.05 9.397 1.180 install.packages(data.table, repos=http://R-Forge.R-project.org;) require(data.table) dt = as.data.table(df) t3 - system.time(zz3 - dt[, list(sumflt=sum(fltval), sumint=sum (intval)), by=id]) Matthew On Thu, 15 Apr 2010 13:09:17 +, hadley wickham wrote: On Thu, Apr 15, 2010 at 1:16 AM, Chuck vijay.n...@gmail.com wrote: Depending on the size of the dataframe and the operations you are trying to perform, aggregate or ddply may be better. In the function below, df has the same structure as your dataframe. Current version of plyr: agg ddply X100.005 0.007 X100 0.007 0.026 X1000 0.086 0.248 X1 0.577 3.136 X1e.05 4.493 44.147 Development version of plyr: agg ddply X100.003 0.005 X100 0.007 0.007 X1000 0.042 0.044 X1 0.410 0.443 X1e.05 4.479 4.237 So there are some big speed improvements in the works. Hadley __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] match function or ==
Please install v1.3 from R-forge : install.packages(data.table,repos=http://R-Forge.R-project.org;) It will be ready for CRAN soon. Please follow up on datatable-h...@lists.r-forge.r-project.org Matthew bo bozha...@hotmail.com wrote in message news:1270689586866-1755876.p...@n4.nabble.com... Thank you very much for the help. I installed data.table package, but I keep getting the following warnings: setkey(DT,id,date) Warning messages: 1: In `[.data.table`(deref(x), o) : This R session is 2.4.0. Please upgrade to 2.4.0+. I'm using R 2.10, but why I keep getting warnings on upgrades. Thanks again. -- View this message in context: http://n4.nabble.com/match-function-or-tp1754505p1755876.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Code is too slow: mean-centering variables in a dataframebysubgroup
Hi Dimitri, A start has been made at explaining .SD in FAQ 2.1. This was previously on a webpage, but its just been moved to a vignette : https://r-forge.r-project.org/plugins/scmsvn/viewcvs.php/*checkout*/branch2/inst/doc/faq.pdf?rev=68root=datatable Please note: that vignette is part of a development branch on r-forge, and as such isn't even released to the r-forge repository yet. Please also see FAQ 4.5 in that vignette and follow up on datatable-h...@lists.r-forge.r-project.org An introduction vignette is taking shape too (again, in the development branch i.e. bleeding edge) : https://r-forge.r-project.org/plugins/scmsvn/viewcvs.php/*checkout*/branch2/inst/doc/intro.pdf?rev=68root=datatable HTH Matthew Dimitri Liakhovitski ld7...@gmail.com wrote in message news:r2rdae9a2a61004071314xc03ae851n4c9027b28df5a...@mail.gmail.com... Yes, Tom's solution is indeed the fastest! On my PC it took .17-.22 seconds while using ave() took .23-.27 seconds. And of course - the last two methods I mentioned took 1.3 SECONDS, not MINUTES (it was a typo). All that is left to me is to understand what .SD stands for. :-) Dimitri On Wed, Apr 7, 2010 at 4:04 PM, Rob Forler rfor...@uchicago.edu wrote: Leave it up to Tom to solve things wickedly fast :) Just as an fyi Dimitri, Tom is one of the developers of data.table. -Rob On Wed, Apr 7, 2010 at 2:51 PM, Dimitri Liakhovitski ld7...@gmail.com wrote: Wow, thank you, Tom! On Wed, Apr 7, 2010 at 3:46 PM, Tom Short tshort.rli...@gmail.com wrote: Here's how I would have done the data.table method. It's a bit faster than the ave approach on my machine: # install.packages(data.table,repos=http://R-Forge.R-project.org;) library(data.table) f3 - function(frame) { + frame - as.data.table(frame) + frame[, lapply(.SD[,2:ncol(.SD), with = FALSE], + function(x) x / mean(x, na.rm = TRUE)), + by = group] + } system.time(new.frame2 - f2(frame)) # ave user system elapsed 0.50 0.08 1.24 system.time(new.frame3 - f3(frame)) # data.table user system elapsed 0.25 0.01 0.30 - Tom Tom Short On Wed, Apr 7, 2010 at 12:46 PM, Dimitri Liakhovitski ld7...@gmail.com wrote: I would like to thank once more everyone who helped me with this question. I compared the speed for different approaches. Below are the results of my comparisons - in case anyone is interested: ### Building an EXAMPLE FRAME with N rows - with groups and a lot of NAs: N-10 set.seed(1234) frame-data.frame(group=rep(paste(group,1:10),N/10),a=rnorm(1:N),b=rnorm(1:N),c=rnorm(1:N),d=rnorm(1:N),e=rnorm(1:N),f=rnorm(1:N),g=rnorm(1:N)) frame-frame[order(frame$group),] ## Introducing 60% NAs: names.used-names(frame)[2:length(frame)] set.seed(1234) for(i in names.used){ i.for.NA-sample(1:N,round((N*.6),0)) frame[[i]][i.for.NA]-NA } lapply(frame[2:8], function(x) length(x[is.na(x)])) # Checking that it worked ORIGframe-frame ## placeholder for the unchanged original frame ### Objective of the code - divide each value by its group mean ### METHOD 1 - the FASTEST - using ave():## frame-ORIGframe f2 - function(frame) { for(i in 2:ncol(frame)) { frame[,i] - ave(frame[,i], frame[,1], FUN=function(x)x/mean(x,na.rm=TRUE)) } frame } system.time({new.frame-f2(frame)}) # Took me 0.23-0.27 sec ### ### METHOD 2 - fast, just a bit slower - using data.table: ## # If you don't have it - install the package - NOT from CRAN: install.packages(data.table,repos=http://R-Forge.R-project.org;) library(data.table) frame-ORIGframe system.time({ table-data.table(frame) colMeanFunction-function(data,key){ data[[key]]=NULL ret=as.matrix(data)/matrix(rep(as.numeric(colMeans(as.data.frame(data),na.rm=T)),nrow(data)),nrow=nrow(data),ncol=ncol(data),byrow=T) return(ret) } groupedMeans = table[,colMeanFunction(.SD, group), by=group] names.to.use-names(groupedMeans) for(i in 1:length(groupedMeans)){groupedMeans[[i]]-as.data.frame(groupedMeans[[i]])} groupedMeans-do.call(cbind, groupedMeans) names(groupedMeans)-names.to.use }) # Took me 0.37-.45 sec ### ### METHOD 3 - fast, a tad slower (using model.matrix matrix multiplication):## frame-ORIGframe system.time({ mat - as.matrix(frame[,-1]) mm - model.matrix(~0+group,frame) col.grp.N - crossprod( !is.na(mat), mm ) # Use this line if don't want to use NAs for mean calculations # col.grp.N - crossprod( mat != 0 , mm ) # Use this line if don't want to use zeros for mean calculations mat[is.na(mat)] - 0.0 col.grp.sum - crossprod( mat, mm ) mat - mat / ( t(col.grp.sum/col.grp.N)[ frame$group,] ) is.na(mat) - is.na(frame[,-1]) mat-as.data.frame(mat) }) # Took me 0.44-0.50 sec ### ###
Re: [R] memory error
someone else on this list may be able to give you a ballpark estimate of how much RAM this merge would require. I don't have an absolute estimate, but try data.table::merge, as it needs less working memory than base::merge. 20 million rows of 5 columns isn't beyond 32bit : (1*4 + 4*8)*19758564/1024^3 = 0.662GB Also try sqldf to do the join. Matthew Sharpie ch...@sharpsteen.net wrote in message news:1270102758449-1747733.p...@n4.nabble.com... Janet Choate-2 wrote: Thanx for clarification on stating my problem, Charlie. I am attempting to merge to files, i.e.: hi39 = merge(comb[,c(hillID,geo)], hi.h39, by=c(hillID)) if this is relevant or helps to explain: the file 'comb' is 3 columns and 1127 rows the file 'hi.h39' is 5 columns and 19758564 rows i started a new clean R session in which i was able to read those 2 files in, but get the following error when i try to merge them: R(2175) malloc: *** mmap(size=79036416) failed (error code=12) *** error: can't allocate region *** set a breakpoint in malloc_error_break to debug R(2175) malloc: *** mmap(size=79036416) failed (error code=12) *** error: can't allocate region *** set a breakpoint in malloc_error_break to debug R(2175) malloc: *** mmap(size=158068736) failed (error code=12) *** error: can't allocate region *** set a breakpoint in malloc_error_break to debug R(2175) malloc: *** mmap(size=158068736) failed (error code=12) *** error: can't allocate region *** set a breakpoint in malloc_error_break to debug R(2175) malloc: *** mmap(size=158068736) failed (error code=12) *** error: can't allocate region *** set a breakpoint in malloc_error_break to debug Error: cannot allocate vector of size 150.7 Mb so the final error is Cannot allocate vector of size 150.7 Mb, as suggested when R runs out of memory. i am running R version 2.9.2, on mac os X 10.5 - leopard. any suggestion on how to increase R's memory on a mac? thanx for any much needed help! Janet Ah, so it is indeed a shortage of memory problem. With R 2.9.2, you are likely running a 32 bit version of R which will be limited to accessing at most 4 GB of RAM. You may want to try the newest version of R, 2.10.1, as it includes a 64 bit version that will allow you to access significantly more memory- provided you have the RAM installed on your system. I'm not too hot on memory usage calculation, but someone else on this list may be able to give you a ballpark estimate of how much RAM this merge would require. If it turns out to be a ridiculous amount, you will need to consider breaking the merge up into chunks or finding an out-of-core (i.e. not dependent on RAM for storage) merge tool. Hope this helps! -Charlie - Charlie Sharpsteen Undergraduate-- Environmental Resources Engineering Humboldt State University -- View this message in context: http://n4.nabble.com/memory-error-tp1747357p1747733.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Adding RcppFrame to RcppResultSet causes segmentation fault
Rob, Please look again at Romain's reply to you on 19th March. He informed you then that Rcpp has its own dedicated mailing list and he gave you the link. Matthew R_help Help rhelp...@gmail.com wrote in message news:ad1ead5f1003291753p68d6ed52q572940f13e1c0...@mail.gmail.com... Hi, I'm a bit puzzled. I uses exactly the same code in RcppExamples package to try adding RcppFrame object to RcppResultSet. When running it gives me segmentation fault problem. I'm using gcc 4.1.2 on redhat 64bit. I'm not sure if this is the cause of the problem. Any advice would be greatly appreciated. Thank you. Rob. int numCol=4; std::vectorstd::string colNames(numCol); colNames[0] = alpha; // column of strings colNames[1] = beta; // column of reals colNames[2] = gamma; // factor column colNames[3] = delta; // column of Dates RcppFrame frame(colNames); // Third column will be a factor. In the current implementation the // level names are copied to every factor value (and factors // in the same column must have the same level names). The level names // for a particular column will be factored out (pardon the pun) in // a future release. int numLevels = 2; std::string *levelNames = new std::string[2]; levelNames[0] = std::string(pass); // level 1 levelNames[1] = std::string(fail); // level 2 // First row (this one determines column types). std::vectorColDatum row1(numCol); row1[0].setStringValue(a); row1[1].setDoubleValue(3.14); row1[2].setFactorValue(levelNames, numLevels, 1); row1[3].setDateValue(RcppDate(7,4,2006)); frame.addRow(row1); // Second row. std::vectorColDatum row2(numCol); row2[0].setStringValue(b); row2[1].setDoubleValue(6.28); row2[2].setFactorValue(levelNames, numLevels, 1); row2[3].setDateValue(RcppDate(12,25,2006)); frame.addRow(row2); RcppResultSet rs; rs.add(PreDF, frame); __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Adding RcppFrame to RcppResultSet causes segmentation fault
He could have posted into this thread then at the time to say that. Otherwise it appears like its open. Romain Francois romain.franc...@dbmail.com wrote in message news:4bb4c4b8.2030...@dbmail.com... The thread has been handled in Rcpp-devel. Rob posted there 7 minutes after posting on r-help. FWIW, I think the problem is fixed on the Rcpp 0.7.11 version (on cran incoming) Romain Le 01/04/10 17:47, Matthew Dowle a écrit : Rob, Please look again at Romain's reply to you on 19th March. He informed you then that Rcpp has its own dedicated mailing list and he gave you the link. Matthew R_help Helprhelp...@gmail.com wrote in message news:ad1ead5f1003291753p68d6ed52q572940f13e1c0...@mail.gmail.com... Hi, I'm a bit puzzled. I uses exactly the same code in RcppExamples package to try adding RcppFrame object to RcppResultSet. When running it gives me segmentation fault problem. I'm using gcc 4.1.2 on redhat 64bit. I'm not sure if this is the cause of the problem. Any advice would be greatly appreciated. Thank you. Rob. int numCol=4; std::vectorstd::string colNames(numCol); colNames[0] = alpha; // column of strings colNames[1] = beta; // column of reals colNames[2] = gamma; // factor column colNames[3] = delta; // column of Dates RcppFrame frame(colNames); // Third column will be a factor. In the current implementation the // level names are copied to every factor value (and factors // in the same column must have the same level names). The level names // for a particular column will be factored out (pardon the pun) in // a future release. int numLevels = 2; std::string *levelNames = new std::string[2]; levelNames[0] = std::string(pass); // level 1 levelNames[1] = std::string(fail); // level 2 // First row (this one determines column types). std::vectorColDatum row1(numCol); row1[0].setStringValue(a); row1[1].setDoubleValue(3.14); row1[2].setFactorValue(levelNames, numLevels, 1); row1[3].setDateValue(RcppDate(7,4,2006)); frame.addRow(row1); // Second row. std::vectorColDatum row2(numCol); row2[0].setStringValue(b); row2[1].setDoubleValue(6.28); row2[2].setFactorValue(levelNames, numLevels, 1); row2[3].setDateValue(RcppDate(12,25,2006)); frame.addRow(row2); RcppResultSet rs; rs.add(PreDF, frame); -- Romain Francois Professional R Enthusiast +33(0) 6 28 91 30 30 http://romainfrancois.blog.free.fr |- http://tr.im/OIXN : raster images and RImageJ |- http://tr.im/OcQe : Rcpp 0.7.7 `- http://tr.im/O1wO : highlight 0.1-5 __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] nlrq parameter bounds
Ashley, This appears to be your first post to this list. Welcome to R. Over 2 days is quite a long time to wait though, so you are unlikely to get a reply now. Feedback: since nlrq is in package quantreg, its a question about a package and should be sent to the package maintainer. Some packages though, over 40 of the 664 on r-forge, have dedicated help/devel/forum lists hosted on r-forge. No reply from r-help often, but not always, means you haven't followed some detail of the posting guide or haven't followed this : http://www.catb.org/~esr/faqs/smart-questions.html. HTH Matthew Ashley Greenwood a.greenwo...@pgrad.unimelb.edu.au wrote in message news:45708.131.217.6.9.1269916052.squir...@webmail.student.unimelb.edu.au... Hi there, Can anyone please tell me if it is possible to limit parameters in nlrq() to 'upper' and 'lower' bounds as per nls()? If so how?? Many thanks in advance __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Error grid must have equal distances in each direction
M Joshi, I don't know but I guess that some might have looked at your previous thread on 14 March (also about the geoR package). You received help and good advice then, but it doesn't appear that you are following it. It appears to be a similar problem this time. Also, this list is the wrong place for that question. Please read the posting guide to find out the correct place. Its a question about a package. HTH, Matthew maddy madhura1...@gmail.com wrote in message news:1269974076132-1745651.p...@n4.nabble.com... Hello All, Can anyone please help me on this error? Error in FUN(X[[1L]], ...) : different grid distances detected, but the grid must have equal distances in each direction -- try gridtriple=TRUE that avoids numerical errors. The program that I am trying to run posted in the previous post of this thread.After the rows 1021 of my matrix of size 1024*1024, I start getting all the values as 0s. How to set the gridtriple as I am using the grf function which does not take this parameter as input. The maximum vector limit that can be reached in 'R' is 2^30, why does it not allow me to create arrays of length even of size 2^17? Thanks, M Joshi -- View this message in context: http://n4.nabble.com/Error-grid-must-have-equal-distances-in-each-direction-tp1695189p1745651.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Question about 'logit' and 'mlogit' in Zelig
Abraham, This appears to be your 3rd unanswered post to r-help in March, all 3 have been about the Zelig package. Please read the posting guide and find out the correct place to send questions about packages. Then you might get an answer. HTH Matthew Mathew, Abraham T amat...@ku.edu wrote in message news:281f7a5fdfef844696011cb21185f8ac0be...@mailbox-11.home.ku.edu... I'm running a multinomial logit in R using the Zelig packages. According to str(trade962a), my dependent variable is a factor with three levels. When I run the multinomial logit I get an error message. However, when I run 'model=logit' it works fine. any ideas on whats wrong? ## MULTINOMIAL LOGIT anes96two - zelig(trade962a ~ age962 + education962 + personal962 + economy962 + partisan962 + employment962 + union962 + home962 + market962 + race962 + income962, model=logit, data=data96) summary(anes96two) #Error in attr(tt, depFactors)$depFactorVar : # $ operator is invalid for atomic vectors ## LOGIT Call: zelig(formula = trade962a ~ age962 + education962 + personal962 + economy962 + partisan962 + employment962 + union962 + home962 + market962 + race962 + income962, model = logit, data = data96) Deviance Residuals: Min 1Q Median 3Q Max -2.021 -1.179 0.764 1.032 1.648 Coefficients: Estimate Std. Error z value Pr(|z|) (Intercept) -0.697675 0.600991 -1.161 0.2457 age962 0.003235 0.004126 0.784 0.4330 education962 -0.065198 0.038002 -1.716 0.0862 . personal9620.006827 0.072421 0.094 0.9249 economy962-0.200535 0.084554 -2.372 0.0177 * partisan9620.092361 0.079005 1.169 0.2424 employment962 -0.009346 0.044106 -0.212 0.8322 union962 -0.016293 0.149887 -0.109 0.9134 home962 -0.150221 0.133685 -1.124 0.2611 market962 0.292320 0.128636 2.272 0.0231 * race9620.205828 0.094890 2.169 0.0301 * income962 0.263363 0.048275 5.455 4.89e-08 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 1841.2 on 1348 degrees of freedom Residual deviance: 1746.3 on 1337 degrees of freedom (365 observations deleted due to missingness) AIC: 1770.3 Number of Fisher Scoring iterations: 4 Thanks Abraham __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] zero standard errors with geeglm in geepack
You may not have got an answer because you posted to the wrong place. Its a question about a package. Please read the posting guide. miriza miri...@sfwmd.gov wrote in message news:1269886286228-1695430.p...@n4.nabble.com... Hi! I am using geeglm to fit a Poisson model to a timeseries of count data as follows. Since there are no clusters I use 73 values of 1 for the ids. The problem I have is that I am getting standard errors of zero for the parameters. What am I doing wrong? Thanks, Michelle N_Base [1] 95 85 104 88 102 104 91 88 85 115 96 83 91 107 96 116 118 103 89 88 101 117 82 80 83 103 115 119 95 90 82 91 108 115 93 96 72 [38] 98 95 98 97 104 86 107 92 94 95 100 107 76 104 101 80 102 100 91 96 89 71 109 97 113 99 127 115 91 81 73 69 92 90 78 57 Year [1] 1932 1933 1934 1935 1936 1937 1938 1939 1940 1941 1942 1943 1944 1945 1946 1947 1948 1949 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 [31] 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 [61] 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2006 tes=geese(formula = N_Base ~ Year, id = rep(1, 73), family = poisson, corstr = ar1) summary(tes) Call: geese(formula = N_Base ~ Year, id = rep(1, 73), family = poisson, corstr = ar1) Mean Model: Mean Link: log Variance to Mean Relation: poisson Coefficients: estimate san.se wald p (Intercept) 7.1131 0 Inf 0 Year -0.0013 0 Inf 0 Scale Model: Scale Link:identity Estimated Scale Parameters: estimate san.se wald p (Intercept) 1.79 0 Inf 0 Correlation Model: Correlation Structure: ar1 Correlation Link: identity Estimated Correlation Parameters: estimate san.se wald p alpha0.187 0 Inf 0 Returned Error Value:0 Number of clusters: 1 Maximum cluster size: 73 -- View this message in context: http://n4.nabble.com/zero-standard-errors-with-geeglm-in-geepack-tp1695430p1695430.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] GEE for a timeseries of count (one cluster)
Contact the authors of those packages ? miriza miri...@sfwmd.gov wrote in message news:1269981675252-1745896.p...@n4.nabble.com... Hi! I was wondering if there were any packages that would allow me to fit a GEE to a single timeseries of counts so that I could account for autocorrelation in the data. I tried gee, geepack and yags packages, but I do not get standard errors for the parameters when using a single cluster. Any tips? Thanks, Michelle -- View this message in context: http://n4.nabble.com/GEE-for-a-timeseries-of-count-one-cluster-tp1745896p1745896.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] mcmcglmm starting value example
Apparently not, since this your 3rd unanswered thread to r-help this month about this package. Please read the posting guide and find out where you should send questions about packages. Then you might get an answer. ping chen chen1984...@yahoo.com.cn wrote in message news:975148.47160...@web15304.mail.cnb.yahoo.com... Hi R-users: Can anyone give an example of giving starting values for MCMCglmm? I can't find any anywhere. I have 1 random effect (physicians, and there are 50 of them) and family=ordinal. How can I specify starting values for my fixed effects? It doesn't seem to have the option to do so. Thanks, Ping __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] GLM / large dataset question
Geelman, This appears to be your first post to this list. Welcome to R. Nearly 2 days is quite a long time to wait though, so you are unlikely to get a reply now. Feedback : the question seems quite vague and imprecise. It depends on which R you mean (32bit/64bit) and how much ram you have. It also depends on your data and what you want to do with it. Did you mean 100.000 (i.e. one hundred) or 100,000. Also, '8000 explanatory variables' seems a lot, especially to be stored in 'a factor'. There is no R code in your post so we can't tell if you're using glm correctly or not. You could provide the result of object.size(), and dim() on your data rather than explaining it in words. No reply often, but not always, means you haven't followed some detail of the posting guide or haven't followed this : http://www.catb.org/~esr/faqs/smart-questions.html. HTH Matthew geelman geel...@zonnet.nl wrote in message news:mkedkcmimcmgohidffmbieklcaaa.geel...@zonnet.nl... LS, How large a dataset can glm fit with a binomial link function? I have a set of about 100.000 observations and about 8000 explanatory variables (a factor with 8000 levels). Is there a way to find out how large datasets R can handle in general? Thanks in advance, geelman __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Combing
Val, Type combine two data sets (text you wrote in your post) into www.rseek.org. The first two links are: Quick-R: Merge and Merging data: A tutorial. Isn't it quicker for you to use rseek, rather than the time it takes to write a post and wait for a reply ? Don't you also get more detailed information that way too ? You already received advice from others on this list to look at www.rseek.org on 26 Oct, package 'sos' on 27 Oct, and to 'read the manuals and FAQs before posting' on 5 Nov. This month you have posted 3 times : Loop, Renumbering and Combing. References : 1. Posting Guide headings : Do your homework before posting and Further resources 2. Contributed Documentation e.g. 'R Reference Card' by Tom Short http://cran.r-project.org/doc/contrib/Short-refcard.pdf. 3. Eric Raymond's essay http://www.catb.org/~esr/faqs/smart-questions.html. e.g. you posted to r-help 10 times so far, 9 of the 10 subjects were either a single word, or a single function name. HTH Matthew Val valkr...@gmail.com wrote in message news:cdc083ac1003290413s7e047e25lc4202568af119...@mail.gmail.com... Hi all, I want to combine two data sets (ZA and ZB to get ZAB). The common variable between the two data sets is ID. Data ZA ID F M 1 0 0 2 0 0 3 1 2 4 1 0 5 3 2 6 5 4 Data ZB ID v1 v2 v3 3 2.5 3.4 302 4 8.6 2.9 317 5 9.7 4.0 325 6 7.5 1.9 296 Output (ZAB) ID F M v1 v2 v3 1 0 0 -9 -9 -9 2 0 0 -9 -9 -9 3 1 2 2.5 3.4 302 4 1 0 8.6 2.9 317 5 3 2 9.7 4.0 325 6 5 4 7.5 1.9 296 Any help is highly appreciated in advance, Val [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] NA values in indexing
The type of 'NA' is logical. So x[NA] behaves more like x[TRUE] i.e. silent recycling. class(NA) [1] logical x=101:108 x[NA] [1] NA NA NA NA NA NA NA NA x[c(TRUE,NA)] [1] 101 NA 103 NA 105 NA 107 NA x[as.integer(NA)] [1] NA HTH Matthew Barry Rowlingson b.rowling...@lancaster.ac.uk wrote in message news:d8ad40b51003260509y6b671e53o9f79142d2b52c...@mail.gmail.com... If you index a vector with a vector that has NA in it, you get NA back: x=101:107 x[c(NA,4,NA)] [1] NA 104 NA x[c(4,NA)] [1] 104 NA All well and good. ?[ says, under NAs in indexing: When extracting, a numerical, logical or character NA index picks an unknown element and so returns NA in the corresponding element of a logical, integer, numeric, complex or character result, and NULL for a list. (It returns 00 for a raw result.] But if the indexing vector is all NA, you get back a vector of length of your original vector rather than of your index vector: x[c(NA,NA)] [1] NA NA NA NA NA NA NA Maybe it's just me, but I find this surprising, and I can't see it documented. Bug or undocumented feature? Apologies if I've missed something obvious. Barry sessionInfo() R version 2.11.0 alpha (2010-03-25 r51407) i686-pc-linux-gnu locale: [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_GB.UTF-8LC_COLLATE=en_GB.UTF-8 [5] LC_MONETARY=C LC_MESSAGES=en_GB.UTF-8 [7] LC_PAPER=en_GB.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] translating SQL statements into data.table operations
Nick, Good question, but just sent to the wrong place. The posting guide asks you to contact the package maintainer first before posting to r-help only if you don't hear back. I guess one reason for that is that if questions about all 2000+ packages were sent to r-help, then r-help's traffic could go through the roof. Another reason could be that some (i.e. maybe many, maybe few) package maintainers don't actually monitor r-help and might miss any messages you post here. I only saw this one thanks to google alerts. Since I'm writing anyway ... are you using the latest version on r-forge which has the very fast grouping? Have you set multi-column keys on both edt and cdt and tried edt[cdt,roll=TRUE] syntax ? We'll help you off list to climb the learning curve quickly. We are working on FAQs and a vignette and they should be ready soon too. Please do follow up with us (myself and Tom Short cc'd are the main developers) off list and one of us will be happy to help further. Matthew Nick Switanek nswita...@gmail.com wrote in message news:772ec1011003241351v6a3f36efqb0b0787564691...@mail.gmail.com... I've recently stumbled across data.table, Matthew Dowle's package. I'm impressed by the speed of the package in handling operations with large data.frames, but am a bit overwhelmed with the syntax. I'd like to express the SQL statement below using data.table operations rather than sqldf (which was incredibly slow for a small subset of my financial data) or import/export with a DBMS, but I haven't been able to figure out how to do it. I would be grateful for your suggestions. nick My aim is to join events (trades) from two datasets (edt and cdt) where, for the same stock, the events in one dataset occur between 15 and 75 days before the other, and within the same time window. I can only see how to express the WHERE e.SYMBOL = c.SYMBOL part in data.table syntax. I'm also at a loss at whether I can express the remainder using data.table's %between% operator or not. ctqm - sqldf(SELECT e.*, c.DATE 'DATEctrl', c.TIME 'TIMEctrl', c.PRICE 'PRICEctrl', c.SIZE 'SIZEctrl' FROM edt e, ctq c WHERE e.SYMBOL = c.SYMBOL AND julianday(e.DATE) - julianday(c.DATE) BETWEEN 15 AND 75 AND strftime('%H:%M:%S',c.TIME) BETWEEN strftime('%H:%M:%S',e.BEGTIME) AND strftime('%H:%M:%S',e.ENDTIME)) [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Mosaic
When you click search on the R homepage, type mosaic into the box, and click the button, do the top 3 links seem relevant ? Your previous 2 requests for help : 26 Feb : Response was SuppDists. Yet that is the first hit returned by the subject line you posted : Hartleys table 22 Feb : Response was shapiro.test. Yet that is in the second hit returned by the subject line you posted : normality in split plot design Spot the pattern ? Silvano silv...@uel.br wrote in message news:a9322645c4f846a3a6a9daaa8b5a2...@ccepc... Hi, I have this data set: obitoss = c( 5.8,17.4,5.9,17.6,5.8,17.5,4.7,15.8, 3.8,13.4,3.8,13.5,3.7,13.4,3.4,13.6, 4.4,17.3,4.3,17.4,4.2,17.5,4.3,17.0, 4.4,13.6,5.1,14.6,5.7,13.5,3.6,13.3, 6.5,19.6,6.4,19.4,6.3,19.5,6.0,19.7) (dados = data.frame( regiao = factor(rep(c('Norte', 'Nordeste', 'Sudeste', 'Sul', 'Centro-Oeste'), each=8)), ano = factor(rep(c('2000','2001','2002','2003'), each=2)), sexo = factor(rep(c('F','M'), 4)), resp=obitoss)) I would like to make a mosaic to represent the numeric variable depending on 3 variables. Does anyone know how to do? -- Silvano Cesar da Costa Departamento de Estatística Universidade Estadual de Londrina Fone: 3371-4346 __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] If else statements
Here are some references. Please read these first and post again if you are still stuck after reading them. If you do post again, we will need x and y. 1. Introduction to R : 9.2.1 Conditional execution: if statements. 2. R Language Definition : 3.2 Control structures. 3. R for beginners by E Paradis : 6.1 Loops and vectorization 4. Eric Raymond's essay How to Ask Questions The Smart Way http://www.catb.org/~esr/faqs/smart-questions.html. HTH Matthew tj girlm...@yahoo.com wrote in message news:1269325933723-1678705.p...@n4.nabble.com... Hi everyone! May I request again for your help? I need to make some codes using if else statements... Can I do an if-else statement inside an if-else statement? Is this the correct form of writing it? Thank you.=) Example: for (v in 1:6) { for (i in 2:200) { if (v==1) (if max(x*v-y*v)1 break()) if (v==2) (if max(x*v-y*v)1.8 break()) if (v==3) (if max(x*v-y*v)2 break()) } } -- View this message in context: http://n4.nabble.com/If-else-statements-tp1678705p1678705.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Forecasting with Panel Data
Ricardo, I see you got no public answer so far, on either of the two lists you posted to at the same time yesterday. You are therefore unlikely to ever get a reply. I also see you've been having trouble getting answers in the past, back to Nov 09, at least. For example no reply to Credit Migration Matrix (Jan 2010) and no reply to Help with a Loop in function (Nov 2009). For your information, this is a public place and it took me about 10 seconds to assess you. Anyone else on the planet can do this too. Please read the posting guide AND the links from it, especially the last link. I suggest you read it fully, and slowly. I think its just that you didn't know about it, or somehow missed it by accident. You were told to read it though, at the time you subscribed to this list, at least. Don't worry, this is not a huge problem. You can build up your reputation again very quickly. With the kindest of regards, Matthew Ricardo Gonçalves Silva ricard...@terra.com.br wrote in message news:df406bd9dbe644a9b8c0642a3c3f8...@ricardopc... Dear Users, Can I perform panel data (fixed effects model) out of sample forecasts using R? Thanks in advance, Ricardo. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] speed
Your choice of subject line alone shows some people that you missed some small details from the posting guide. The ability to notice small details may be important for you to demonstrate in future. Any answer in this thread is unlikely to be found by a topic search on subject lines alone since speed is a single word. One fast way to increase your reputation is to contribute. You now have an opportunity. If you follow Jim's good advice, discover the answer for yourself, and post it back to the group, changing the subject line so that its easier for others to find in future, thats one way you can contribute and increase your reputation. If you don't do that, thats your choice. It is entirely up to you. Whatever action you take next, even doing nothing is an action, it is visible in public for everyone to search back and find out within seconds. HTH Adam Majewski adamm...@o2.pl wrote in message news:hn6fp4$2g...@dough.gmane.org... Hi, I have found some example of R code : http://commons.wikimedia.org/wiki/File:Mandelbrot_Creation_Animation_%28800x600%29.gif When I run this code on my computer it takes few seconds. I wanted to make similar program in Maxima CAS : http://thread.gmane.org/gmane.comp.mathematics.maxima.general/29949/focus=29968 for example : f(x,y,n) := block([i:0, c:x+y*%i,ER:4,iMax:n,z:0], while abs(z)ER and iiMax do (z:z*z + c,i:i+1), min(ER,abs(z)))$ wxanimate_draw3d( n, 5, enhanced3d=true, user_preamble=set pm3d at b; set view map, xu_grid=70, yv_grid=70, explicit('f(x,y,n), x, -2, 0.7, y, -1.2, 1.2))$ But it takes so long to make even one image ( hours) What makes the difference, and why R so fast ? Regards Adam __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Strange result in survey package: svyvar
This list is the wrong place for that question. The posting guide tells you, in bold, to contact the package maintainer first. If you had already done that, and didn't hear back from him, then you should tell us, so that we know you followed the guide. Corey Sparks corey.spa...@utsa.edu wrote in message news:c7bd3ca5.206a%corey.spa...@utsa.edu... Hi R users, I'm using the survey package to calculate summary statistics for a large health survey (the Demographic and Health Survey for Honduras, 2006), and when I try to calculate the variances for several variables, I get negative numbers. I thought it may be my data, so I ran the example on the help page: data(api) ## one-stage cluster sample dclus1-svydesign(id=~dnum, weights=~pw, data=apiclus1, fpc=~fpc) svyvar(~api00+enroll+api.stu+api99, dclus1) variance SE api0011182.8 1386.4 api0011516.3 1412.9 api.stu -4547.1 3164.9 api9912735.2 1450.1 If I look at the full matrix for the variances (and covariances): test-svyvar(~api00+enroll+api.stu+api99, dclus1) print(test, covariance=T) variance SE api00:api00 11182.8 1386.4 enroll:api00 -5492.4 3458.1 api.stu:api00-4547.1 3164.9 api99:api00 11516.3 1412.9 api00:enroll -5492.4 3458.1 enroll:enroll 136424.3 41377.2 api.stu:enroll 114035.7 34153.9 api99:enroll -3922.3 3589.9 api00:api.stu-4547.1 3164.9 enroll:api.stu 114035.7 34153.9 api.stu:api.stu 96218.9 28413.7 api99:api.stu-3060.0 3260.9 api00:api99 11516.3 1412.9 enroll:api99 -3922.3 3589.9 api.stu:api99-3060.0 3260.9 api99:api99 12735.2 1450.1 I see that the function is actually returning the covariance for the api.stu with the api00 variable. I can get the correct variances if I just take diag(test) But I just was wondering if anyone else was having this problem. I'm using : sessionInfo() R version 2.10.1 Patched (2009-12-20 r50794) x86_64-apple-darwin9.8.0 locale: [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] survey_3.19 loaded via a namespace (and not attached): [1] tools_2.10.1 And have the same error on a linux server. Thanks, Corey -- Corey Sparks Assistant Professor Department of Demography and Organization Studies University of Texas at San Antonio 501 West Durango Blvd Monterey Building 2.270C San Antonio, TX 78207 210-458-3166 corey.sparks 'at' utsa.edu https://rowdyspace.utsa.edu/users/ozd504/www/index.htm __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] IMPORTANT - To remove the null elements from a vector
Welcome to R Barbara. Its quite an incredible community from all walks of life. Your beginner questions are answered in the manual. See Introduction to R. Please read the posting guide again because it contains lots of good advice for you. Some people read it three times before posting because they have so much respect for the community. Sometimes they trip up over themselves to show they have read it. Btw - just to let you know that starting your subject lines with IMPORTANT is considered by some people a demanding tone for free help. Not everyone, but some people. Two posts starting IMPORTANT within 5 minutes is another thing that a very large number of people around the world may have just seen you do. I'm just letting you know, in case you were not aware of this. You received answers from four people who clearly don't mind, and you have your answers. Was that your only goal in posting? Did you consider there might be downsides? This is a public list read by many people and one thing the posting guide says is that your questions are saved in the archives forever. Just checking you knew that. I wouldn't want you to reduce your reputation accidentally. A future employer (it might be a company, or it might be a university) anywhere in the world might do a simple search on your name, and thats why you might not get an interview, because you had showed (in their minds) that you didn't have respect for guidlines. I would hate for something like that to happen, all just because you didn't know you were supposed to read the posting guide, it wouldn't be fair on you. So it would be very unfair of me to know that, and suspect that you don't, but not tell you about the posting guide, wouldn't it ? I hope this information helps you. It is entirely up to you. r-help is a great way to increase your reputation, but it can reduce your reputation too. By asking great questions, or even contributing, you can proudly put that on your CV and increase your chances of getting that interview, or getting that position. I have seen on several CVs from students the text please search for my name on r-help. Just like everything you do in public, r-help is very similar. What you write, you write in the public domain, and you write it free of charge, and free of restriction. All this applies to all us. When asking for help, and when giving help. Matthew barbara.r...@uniroma1.it wrote in message news:of1a8063a1.fc14f5ff-onc12576e1.00466053-c12576e1.00466...@uniroma1.it... I have a vector that have null elements. How to remove these elements? For example: x=[10 0 30 40 0 0] I want the vector y=[10 30 40] Thanks [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] fit a gamma pdf using Residual Sum-of-Squares
Thanks for making it quickly reproducible - I was able to see that message in English within a few seconds. The start has x=86, but the data is also called x. Remove x=86 from start and you get a different error. P.S. - please do include the R version information. It saves time for us, and we like it if you save us time. vincent laperriere vincent_laperri...@yahoo.fr wrote in message news:883644.16455...@web24106.mail.ird.yahoo.com... Hi all, I would like to fit a gamma pdf to my data using the method of RSS (Residual Sum-of-Squares). Here are the data: x - c(86, 90, 94, 98, 102, 106, 110, 114, 118, 122, 126, 130, 134, 138, 142, 146, 150, 154, 158, 162, 166, 170, 174) y - c(2, 5, 10, 17, 26, 60, 94, 128, 137, 128, 77, 68, 65, 60, 51, 26, 17, 9, 5, 2, 3, 7, 3) I have typed the following code, using nls method: fit - nls(y ~ (1/((s^a)*gamma(a))*x^(a-1)*exp(-x/s)), start = c(s=3, a=75, x=86)) But I have the following message error (sorry, this is in German): Fehler in qr(.swts * attr(rhs, gradient)) : Dimensionen [Produkt 3] passen nicht zur Länge des Objektes [23] Zusätzlich: Warnmeldung: In .swts * attr(rhs, gradient) : Länge des längeren Objektes ist kein Vielfaches der Länge des kürzeren Objektes Could anyone help me with the code? I would greatly appreciate it. Sincerely yours, Vincent Laperrière. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] ifthen() question
This post breaks the posting guide in multiple ways. Please read it again (and then again) - in particular the first 3 paragraphs. You will help yourself by following it. The solution is right there in the help page for ?data.frame and other places including Introduction to R. I think its more helpful *not* to tell you what it is, so that you discover it for yourself, learn how to learn, and google. I hope that you appreciate it that I've been helpful just simply (and quickly) telling you the answer *is* there. Having said that, you don't appear to be aware of many of the packages around that does this task - you appear to be re-inventing the wheel. I suggest you briefly investigate each and every one of the top 30 packages ranked by crantastic, before writing any more R code. A little time invested doing that will pay you dividends in the long run. That is not a complaint of you though, as that advice is not in the posting guide. Matthew AC Del Re de...@wisc.edu wrote in message news:85cf8f8d1003040735k2b076142jc99b7ec34da87...@mail.gmail.com... Hi All, I am using a specialized aggregation function to reduce a dataset with multiple rows per id down to 1 row per id. My function work perfect when there are 1 id but alters the 'var.g' in undesirable ways when this condition is not met, Therefore, I have been trying ifthen() statements to keep the original value when length of unique id == 1 but I cannot get it to work. e.g.: #function to aggregate effect sizes: aggs - function(g, n.1, n.2, cor = .50) { n.1 - mean(n.1) n.2 - mean(n.2) N_ES - length(g) corr.mat - matrix (rep(cor, N_ES^2), nrow=N_ES) diag(corr.mat) - 1 g1g2 - cbind(g) %*% g PSI - (8*corr.mat + g1g2*corr.mat^2)/(2*(n.1+n.2)) PSI.inv - solve(PSI) a - rowSums(PSI.inv)/sum(PSI.inv) var.g - 1/sum(PSI.inv) g - sum(g*a) out-cbind(g,var.g, n.1, n.2) return(out) } # automating this procedure for all rows of df. This format works perfect when there is 1 id per row only: agg_g - function(id, g, n.1, n.2, cor = .50) { st - unique(id) out - data.frame(id=rep(NA,length(st))) for(i in 1:length(st)) { out$id[i] - st[i] out$g[i] - aggs(g=g[id==st[i]], n.1= n.1[id==st[i]], n.2 = n.2[id==st[i]], cor)[1] out$var.g[i] - aggs(g=g[id==st[i]], n.1= n.1[id==st[i]], n.2 = n.2[id==st[i]], cor)[2] out$n.1[i] - round(mean(n.1[id==st[i]]),0) out$n.2[i] - round(mean(n.2[id==st[i]]),0) } return(out) } # The attempted solution using ifthen() and minor changes to function but it's not working properly: agg_g - function(df,var.g, id, g, n.1, n.2, cor = .50) { df$var.g - var.g st - unique(id) out - data.frame(id=rep(NA,length(st))) for(i in 1:length(st)) { out$id[i] - st[i] out$g[i] - aggs(g=g[id==st[i]], n.1= n.1[id==st[i]], n.2 = n.2[id==st[i]], cor)[1] out$var.g[i]-ifelse(length(st[i])==1, df$var.g[id==st[i]], aggs(g=g[id==st[i]], n.1= n.1[id==st[i]], n.2 = n.2[id==st[i]], cor)[2]) out$n.1[i] - round(mean(n.1[id==st[i]]),0) out$n.2[i] - round(mean(n.2[id==st[i]]),0) } return(out) } # sample data: id-c(1, rep(1:19)) n.1-c(10,20,13,22,28,12,12,36,19,12,36,75,33,121,37,14,40,16,14,20) n.2 - c(11,22,10,20,25,12,12,36,19,11,34,75,33,120,37,14,40,16,10,21) g - c(.68,.56,.23,.64,.49,-.04,1.49,1.33,.58,1.18,-.11,1.27,.26,.40,.49, .51,.40,.34,.42,1.16) var.g - c(.08,.06,.03,.04,.09,.04,.009,.033,.0058,.018,.011,.027,.026,.0040, .049,.0051,.040,.034,.0042,.016) df-data.frame(id, n.1,n.2, g, var.g) Any help is much appreciated, AC [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Nonparametric generalization of ANOVA
Frank, I respect your views but I agree with Gabor. The posting guide does not support your views. It is not any of our views that are important but we are following the posting guide. It covers affiliation. It says only that some consider it good manners to include a concise signature specifying affiliation. It does not agree that it is bad manners not to. It is therefore going too far to urge R-gurus, whoever they might be, to ignore such postings on that basis alone. It is up to responders (I think that is the better word which is the one used by the posting guide) whether they reply. Missing affiliation is ok by the posting guide. Users shouldn't be put off from posting because of that alone. Sending from an anonymous email address such as BioStudent is also fine by the posting guide as far as my eyes read it. It says only that the email address should work. I would also answer such anonymous posts, providing they demonstrate they made best efforts to follow the posting guide, as usual for all requests for help. Its so easy to send from a false, but apparently real name, why worry about that? If you disagree with the posting guide then you could make a suggestion to get the posting guide changed with respect to these points. But, currently, good and practice is defined by the posting guide, and I can't see that your view is backed up by it. In fact it seems to me that these points were carefully considered, and the wording is careful on these points. As far as I know you are wrong that there is no moderator. There are in fact an uncountable number of people who are empowered to moderate i.e. all of us. In other words its up to the responders to moderate. The posting guide is our guide. As a last resort we can alert the list administrator (which I believe is the correct name for him in that role), who has powers to remove an email address from the list if he thinks that is appropriate, or act otherwise, or not at all. It is actually up to responders (i.e. all of us) to ensure the posting guide is followed. My view is that the problems started with some responders on some occasions. They sometimes forgot, a little bit, to encourage and remind posters to follow the posting guide when it was not followed. This then may have encouraged more posters to think it was ok not to follow the posting guide. That is my own personal view, not a statistical one backed up by any evidence. Matthew Frank E Harrell Jr f.harr...@vanderbilt.edu wrote in message news:4b913880.9020...@vanderbilt.edu... Gabor Grothendieck wrote: I am happy to answer posts to r-help regardless of the name and email address of the poster but would draw the line at someone excessively posting without a reasonable effort to find the answer first or using it for homework since such requests could flood the list making it useless for everyone. Gabor I respectfully disagree. It is bad practice to allow anonymous postings. We need to see real names and real affiliations. r-help is starting to border on uselessness because of the age old problem of the same question being asked every two days, a high frequency of specialty questions, and answers given with the best of intentions in incremental or contradictory e-mail pieces (as opposed to a cumulative wiki or hierarchically designed discussion web forum), as there is no moderator for the list. We don't need even more traffic from anonymous postings. Frank On Fri, Mar 5, 2010 at 10:55 AM, Ravi Varadhan rvarad...@jhmi.edu wrote: David, I agree with your sentiments. I also think that it is bad posting etiquette not to sign one's genuine name and affiliation when asking for help, which blue sky seems to do a lot. Bert Gunter has already raised this issue, and I completely agree with him. I would also like to urge the R-gurus to ignore such postings. Best, Ravi. Ravi Varadhan, Ph.D. Assistant Professor, Division of Geriatric Medicine and Gerontology School of Medicine Johns Hopkins University Ph. (410) 502-2619 email: rvarad...@jhmi.edu - Original Message - From: David Winsemius dwinsem...@comcast.net Date: Friday, March 5, 2010 9:25 am Subject: Re: [R] Nonparametric generalization of ANOVA To: blue sky bluesky...@gmail.com Cc: r-h...@stat.math.ethz.ch On Mar 5, 2010, at 8:19 AM, blue sky wrote: My interpretation of the relation between 1-way ANOVA and Wilcoxon's test (wilcox.test() in R) is the following. 1-way ANOVA is to test if two or multiple distributions are the same, assuming all the distributions are normal and have equal variances. Wilcoxon's test is to test two distributions are the same without assuming what their distributions are. In this sense, I'm wondering what is the generalization of Wilcoxon's test to more than two distributions. And, more general, what
Re: [R] Nonparametric generalization of ANOVA
John, So you want BlueSky to change their name to Paul Smith at New York University, just to give a totally random, false name, example, and then you will be happy ? I just picked a popular, real name at a real, big place. Are you, or is anyone else, going to check its real ? We want BlueSky to ask great questions, which haven't been asked before, and to follow the posting guide. If BlueSky improves the knowledge base whats the problem? This person may well be breaking the posting guide for many other reasons (I haven't looked), and if they are then you could take issue with them on those points, but not for simply writing as BlueSky. David W has got it right when he replied to ManInMoon. Shall we stop this thread now, and follow his lead ? I would have picked ManOnMoon myself but maybe that one was taken. Its rather difficult to be on a moon, let alone inside it. Matthew John Sorkin jsor...@grecc.umaryland.edu wrote in message news:4b91068702cb00064...@medicine.umaryland.edu... The sad part of this interchanges is that Blue Sky does not seem to be amiable to suggestion. He, or she, has not taken note, or responded to the fact that a number of people believe it is good manners to give a real name and affiliation. My mother taught me that when two people tell you that you are drunk you should lie down until the inebriation goes away. Blue Sky, several people have noted that you would do well to give us your name and affiliation. Is this too much to ask given that people are good enough to help you? John John David Sorkin M.D., Ph.D. Chief, Biostatistics and Informatics University of Maryland School of Medicine Division of Gerontology Baltimore VA Medical Center 10 North Greene Street GRECC (BT/18/GR) Baltimore, MD 21201-1524 (Phone) 410-605-7119 (Fax) 410-605-7913 (Please call phone number above prior to faxing) Matthew Dowle mdo...@mdowle.plus.com 3/5/2010 12:58 PM Frank, I respect your views but I agree with Gabor. The posting guide does not support your views. It is not any of our views that are important but we are following the posting guide. It covers affiliation. It says only that some consider it good manners to include a concise signature specifying affiliation. It does not agree that it is bad manners not to. It is therefore going too far to urge R-gurus, whoever they might be, to ignore such postings on that basis alone. It is up to responders (I think that is the better word which is the one used by the posting guide) whether they reply. Missing affiliation is ok by the posting guide. Users shouldn't be put off from posting because of that alone. Sending from an anonymous email address such as BioStudent is also fine by the posting guide as far as my eyes read it. It says only that the email address should work. I would also answer such anonymous posts, providing they demonstrate they made best efforts to follow the posting guide, as usual for all requests for help. Its so easy to send from a false, but apparently real name, why worry about that? If you disagree with the posting guide then you could make a suggestion to get the posting guide changed with respect to these points. But, currently, good and practice is defined by the posting guide, and I can't see that your view is backed up by it. In fact it seems to me that these points were carefully considered, and the wording is careful on these points. As far as I know you are wrong that there is no moderator. There are in fact an uncountable number of people who are empowered to moderate i.e. all of us. In other words its up to the responders to moderate. The posting guide is our guide. As a last resort we can alert the list administrator (which I believe is the correct name for him in that role), who has powers to remove an email address from the list if he thinks that is appropriate, or act otherwise, or not at all. It is actually up to responders (i.e. all of us) to ensure the posting guide is followed. My view is that the problems started with some responders on some occasions. They sometimes forgot, a little bit, to encourage and remind posters to follow the posting guide when it was not followed. This then may have encouraged more posters to think it was ok not to follow the posting guide. That is my own personal view, not a statistical one backed up by any evidence. Matthew Frank E Harrell Jr f.harr...@vanderbilt.edu wrote in message news:4b913880.9020...@vanderbilt.edu... Gabor Grothendieck wrote: I am happy to answer posts to r-help regardless of the name and email address of the poster but would draw the line at someone excessively posting without a reasonable effort to find the answer first or using it for homework since such requests could flood the list making it useless for everyone. Gabor I respectfully disagree. It is bad practice to allow anonymous postings. We
Re: [R] data.table evaluating columns
I'd go a bit further and remind that the r-help posting guide is clear : For questions about functions in standard packages distributed with R (see the FAQ Add-on packages in R), ask questions on R-help. If the question relates to a contributed package , e.g., one downloaded from CRAN, try contacting the package maintainer first. You can also use find(functionname) and packageDescription(packagename) to find this information. ONLY send such questions to R-help or R-devel if you get no reply or need further assistance. This applies to both requests for help and to bug reports. The ONLY is in bold in the posting guide. I changed the bold to capitals above for people reading this in text only. Since Tom and I are friendly and responsive, users of data.table don't usually make it to r-help. We'll follow up this one off-list. Please note that Rob's question is very good by the rest of the posting guide, so no complaints there, only that it was sent to the wrong place. Please keep the questions coming, but send them to us, not r-help. You do sometimes see messages to r-help starting something like I have contacted the authors/maintainers but didn't hear back, does anyone know To not state that they had would be an implicit request for further work by the community (for free) to ask if they had. So its not enough to contact the maintainer first, but you also have to say that you have as well, and perhaps how long ago too would be helpful. For r-forge projects I usually send any question to everyone on the project (easy to find) or if they have a list then to that. HTH Matthew Tom Short tshort.rli...@gmail.com wrote in message news:fd27013a1003021718w409acb32r1281dfeca5593...@mail.gmail.com... On Tue, Mar 2, 2010 at 7:09 PM, Rob Forler rfor...@uchicago.edu wrote: Hi everyone, I have the following code that works in data frames taht I would like tow ork in data.tables . However, I'm not really sure how to go about it. I basically have the following names = c(data1, data2) frame = data.frame(list(key1=as.integer(c(1,2,3,4,5,6)), key2=as.integer(c(1,2,3,2,5,6)),data1 = c(3,3,2,3,5,2), data2= c(3,3,2,3,5,2))) for(i in 1:length(names)){ frame[, paste(names[i], flag)] = frame[,names[i]] 3 } Now I try with data.table code: names = c(data1, data2) frame = data.table(list(key1=as.integer(c(1,2,3,4,5,6)), key2=as.integer(c(1,2,3,2,5,6)),data1 = c(3,3,2,3,5,2), data2= c(3,3,2,3,5,2))) for(i in 1:length(names)){ frame[, paste(names[i], flag), with=F] = as.matrix(frame[,names[i], with=F] ) 3 } Rob, this type of question is better for the package maintainer(s) directly rather than R-help. That said, one answer is to use list addressing: for(i in 1:length(names)){ frame[[paste(names[i], flag)]] = frame[[names[i]]] 3 } Another option is to manipulate frame as a data frame and convert to data.table when you need that functionality (conversion is quick). In the data table version, frame[,names[i], with=F] is the same as frame[,names[i], drop=FALSE] (the answer is a list, not a vector). Normally, it's easier to use [[]] or $ indexing to get this. Also, fname[i,j] - something assignment is still a bit buggy for data.tables. - Tom Tom Short __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] data.table evaluating columns
That in itself is a question for the maintainer, off r-help. When the posting guide says contact the package maintainer first it means it literally and applies even to questions about the existence of a mailing list for the package. So what I'm supposed to do now is tell you how the posting guide works, and tell you that I'll reply off list. Then hopefully the community will be happy with me too. So I'll reply off list :-) Rob Forler rfor...@uchicago.edu wrote in message news:eb472fec1003030502s4996511ap8dfd329a3...@mail.gmail.com... Okay I appreciate the help, and I appreciate the FAQ reminder. I will read the r-help posting guide. I'm relatively new to using the support systems around R. So far everyone has been really helpful. I'm confused as to which data.table list I should be using. http://lists.r-forge.r-project.org/pipermail/datatable-commits/ doesn't appear to be correct. Or just directly sending an email to all of you? Thanks again, Rob On Wed, Mar 3, 2010 at 6:05 AM, Matthew Dowle mdo...@mdowle.plus.comwrote: I'd go a bit further and remind that the r-help posting guide is clear : For questions about functions in standard packages distributed with R (see the FAQ Add-on packages in R), ask questions on R-help. If the question relates to a contributed package , e.g., one downloaded from CRAN, try contacting the package maintainer first. You can also use find(functionname) and packageDescription(packagename) to find this information. ONLY send such questions to R-help or R-devel if you get no reply or need further assistance. This applies to both requests for help and to bug reports. The ONLY is in bold in the posting guide. I changed the bold to capitals above for people reading this in text only. Since Tom and I are friendly and responsive, users of data.table don't usually make it to r-help. We'll follow up this one off-list. Please note that Rob's question is very good by the rest of the posting guide, so no complaints there, only that it was sent to the wrong place. Please keep the questions coming, but send them to us, not r-help. You do sometimes see messages to r-help starting something like I have contacted the authors/maintainers but didn't hear back, does anyone know To not state that they had would be an implicit request for further work by the community (for free) to ask if they had. So its not enough to contact the maintainer first, but you also have to say that you have as well, and perhaps how long ago too would be helpful. For r-forge projects I usually send any question to everyone on the project (easy to find) or if they have a list then to that. HTH Matthew Tom Short tshort.rli...@gmail.com wrote in message news:fd27013a1003021718w409acb32r1281dfeca5593...@mail.gmail.com... On Tue, Mar 2, 2010 at 7:09 PM, Rob Forler rfor...@uchicago.edu wrote: Hi everyone, I have the following code that works in data frames taht I would like tow ork in data.tables . However, I'm not really sure how to go about it. I basically have the following names = c(data1, data2) frame = data.frame(list(key1=as.integer(c(1,2,3,4,5,6)), key2=as.integer(c(1,2,3,2,5,6)),data1 = c(3,3,2,3,5,2), data2= c(3,3,2,3,5,2))) for(i in 1:length(names)){ frame[, paste(names[i], flag)] = frame[,names[i]] 3 } Now I try with data.table code: names = c(data1, data2) frame = data.table(list(key1=as.integer(c(1,2,3,4,5,6)), key2=as.integer(c(1,2,3,2,5,6)),data1 = c(3,3,2,3,5,2), data2= c(3,3,2,3,5,2))) for(i in 1:length(names)){ frame[, paste(names[i], flag), with=F] = as.matrix(frame[,names[i], with=F] ) 3 } Rob, this type of question is better for the package maintainer(s) directly rather than R-help. That said, one answer is to use list addressing: for(i in 1:length(names)){ frame[[paste(names[i], flag)]] = frame[[names[i]]] 3 } Another option is to manipulate frame as a data frame and convert to data.table when you need that functionality (conversion is quick). In the data table version, frame[,names[i], with=F] is the same as frame[,names[i], drop=FALSE] (the answer is a list, not a vector). Normally, it's easier to use [[]] or $ indexing to get this. Also, fname[i,j] - something assignment is still a bit buggy for data.tables. - Tom Tom Short __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Three most useful R package
Dieter, One way to check if a package is active, is by looking on r-forge. If you are referring to data.table you would have found it is actually very active at the moment and is far from abandoned. What you may be referring to is a warning, not an error, with v1.2 on R2.10+. That was fixed many moons ago. The r-forge version is where its at. Rather than commenting in public about a warning on a package, and making a conclusion about its abandonment, and doing this without copying the maintainer, perhaps you could have contacted the maintainer to let him know you had found a problem. That would have been a more community spirited action to take. Doing that at the time you found out would have been helpful too rather than saving it up for now. Or you can always check the svn logs yourself, as the r-forge guys even made that trivial to do. All, Can we please now stop this thread ? The crantastic people worked hard to provide a better solution. If the community refuses to use crantastic, thats up to the community, but to start now filling up r-help with votes on packages when so much effort was put in to a much much better solution ages ago? Its as quick to put your votes into crantastic as it is to write to r-help. What your problem, folks, with crantastic? The second reply mentioned crantastic but you all chose to ignore it, it seems. If you want to vote, use crantastic. If you don't want to vote, don't vote. But using r-help to vote ?! The better solution is right there: http://crantastic.org/ Matthew Dieter Menne dieter.me...@menne-biomed.de wrote in message news:1267626882999-1576618.p...@n4.nabble.com... Rob Forler wrote: And data.table because it does aggregation about 50x times faster than plyr (which I used to use a lot). This is correct, from the error message its spits out one has to conclude that is was abandoned at R-version 2.4.x Dieter -- View this message in context: http://n4.nabble.com/Three-most-useful-R-package-tp1575671p1576618.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Reading large files
I agree with Jim. The term do analysis is almost meaningless, the posting guide makes reference to statements such as that. At least he tried to define large, but inconsistenly (first of all 850MB, then changed to 10-20-15GB). Satish wrote: at one time I will need to load say 15GB into R Assuming the user is always right then, here is some information : R has been 64bit on unix for a very long time (over a decade). 64bit R is also available for Win64. It uses as much RAM you install on the box, e.g. 64GB. Yes R users do that, and they've been doing that for years and years. The data.table package was mainly designed for 64bit, although its a point of consternation when people think thats all its useful for. If you don't have the hardware, then you can rent the time on EC2. There are tools and packages to make that easy e.g. pre-built images you can just use. Look at the HPC task view. Search the archives. Don't miss Biocep at http://biocep-distrib.r-forge.r-project.org/doc.html. Albert Einstein said A clever person solves a problem. A wise person avoids it.. So an option for you is to be wise and move to 64bit. jim holtman jholt...@gmail.com wrote in message news:644e1f321002050513y242304der84b5674930b54...@mail.gmail.com... Where should be shine it? No information provided on operating system, version, memory, size of files, what you want to do with them, etc. Lot of options: put it in a database, read partial file (lines and/or columns), preprocess, etc. Your option. On Fri, Feb 5, 2010 at 8:03 AM, Satish Vadlamani satish.vadlam...@fritolay.com wrote: Folks: Can anyone throw some light on this? Thanks. Satish - Satish Vadlamani -- View this message in context: http://n4.nabble.com/Reading-large-files-tp1469691p1470169.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Jim Holtman Cincinnati, OH +1 513 646 9390 What is the problem that you are trying to solve? __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Reading large files
I can't help you further than whats already been posted to you. Maybe someone else can. Best of luck. Satish Vadlamani satish.vadlam...@fritolay.com wrote in message news:1265397089104-1470667.p...@n4.nabble.com... Matthew: If it is going to help, here is the explanation. I have an end state in mind. It is given below under End State header. In order to get there, I need to start somewhere right? I started with a 850 MB file and could not load in what I think is reasonable time (I waited for an hour). There are references to 64 bit. How will that help? It is a 4GB RAM machine and there is no paging activity when loading the 850 MB file. I have seen other threads on the same types of questions. I did not see any clear cut answers or errors that I could have been making in the process. If I am missing something, please let me know. Thanks. Satish End State Satish wrote: at one time I will need to load say 15GB into R - Satish Vadlamani -- View this message in context: http://n4.nabble.com/Reading-large-files-tp1469691p1470667.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] merging columns
Yes. data.df[,wcol,drop=FALSE] For an explanation of drop see ?[.data.frame Chuck White chuckwhi...@charter.net wrote in message news:20100202212800.o8xbu.681696.r...@mp11... Additional clarification: the problem only comes when you have one column selected from the original dataframe. You need to make the following modification to the original example: data.df - data.frame(aa=c(1,1,0), cc=c(1,0,0), aab=c(0,1,0), aac=c(0,0,1), bb=c(1,0,1)) And, the following seems to work: data.frame(sapply(col2.uniq, function(col) { wcol - which(col==col2) as.numeric(rowSums(data.frame(data.df[,wcol]))0) })) I had to wrap data.df[,wcol] in another data.frame to handle situations where wcol had one element. Is there a better approach? Chuck White chuckwhi...@charter.net wrote: Hello -- I am trying to merge columns in a dataframe based on substring matches in colnames. I would appreciate if somebody can suggest a faster/cleaner approach (eg. I would have really liked to avoid the if-else piece but rowSums does not like that). Thanks. data.df - data.frame(aa=c(1,1,0), bbcc=c(1,0,0), aab=c(0,1,0), aac=c(0,0,1), bbk=c(1,0,1)) col2 - substr(colnames(data.df),1,2) col2.uniq - unique(col2) names(col2.uniq) - col2.uniq data.frame(sapply(col2.uniq, function(col) { wcol - which(col==col2) if(length(wcol)1) { tmp - rowSums(data.df[,wcol]) } else { tmp - data.df[,wcol] } as.numeric(tmp0) })) __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] RMySQL - Bulk loading data and creating FK links
How it represents data internally is very important, depending on the real goal : http://en.wikipedia.org/wiki/Column-oriented_DBMS Gabor Grothendieck ggrothendi...@gmail.com wrote in message news:971536df1001271710o4ea62333l7f1230b860114...@mail.gmail.com... How it represents data internally should not be important as long as you can do what you want. SQL is declarative so you just specify what you want rather than how to get it and invisibly to the user it automatically draws up a query plan and then uses that plan to get the result. On Wed, Jan 27, 2010 at 12:48 PM, Matthew Dowle mdo...@mdowle.plus.com wrote: sqldf(select * from BOD order by Time desc limit 3) Exactly. SQL requires use of order by. It knows the order, but it isn't ordered. Thats not good, but might be fine, depending on what the real goal is. Gabor Grothendieck ggrothendi...@gmail.com wrote in message news:971536df1001270629w4795da89vb7d77af6e4e8b...@mail.gmail.com... On Wed, Jan 27, 2010 at 8:56 AM, Matthew Dowle mdo...@mdowle.plus.com wrote: How many columns, and of what type are the columns ? As Olga asked too, it would be useful to know more about what you're really trying to do. 3.5m rows is not actually that many rows, even for 32bit R. Its depends on the columns and what you want to do with those columns. At the risk of suggesting something before we know the full facts, one possibility is to load the data from flat file into data.table. Use setkey() to set your keys. Use tables() to summarise your various tables. Then do your joins etc all-in-R. data.table has fast ways to do those sorts of joins (but we need more info about your task). Alternatively, you could check out the sqldf website. There is an sqlread.csv (or similar name) which can read your files directly into SQL read.csv.sql instead of going via R. Gabor has some nice examples there about that and its faster. You use some buzzwords which makes me think that SQL may not be appropriate for your task though. Can't say for sure (because we don't have enough information) but its possible you are struggling because SQL has no row ordering concept built in. That might be why you've created an increment In the SQLite database it automatically assigns a self incrementing hidden column called rowid to each row. e.g. using SQLite via the sqldf package on CRAN and the BOD data frame which is built into R we can display the rowid column explicitly by referring to it in our select statement: library(sqldf) BOD Time demand 1 1 8.3 2 2 10.3 3 3 19.0 4 4 16.0 5 5 15.6 6 7 19.8 sqldf(select rowid, * from BOD) rowid Time demand 1 1 1 8.3 2 2 2 10.3 3 3 3 19.0 4 4 4 16.0 5 5 5 15.6 6 6 7 19.8 field? Do your queries include order by incrementing field? SQL is not good at first and last type logic. An all-in-R solution may well be In SQLite you can get the top 3 values, say, like this (continuing the prior example): sqldf(select * from BOD order by Time desc limit 3) Time demand 1 7 19.8 2 5 15.6 3 4 16.0 better, since R is very good with ordered vectors. A 1GB data.table (or data.frame) for example, at 3.5m rows, could have 76 integer columns, or 38 double columns. 1GB is well within 32bit and allows some space for working copies, depending on what you want to do with the data. If you have 38 or less columns, or you have 64bit, then an all-in-R solution *might* get your task done quicker, depending on what your real goal is. If this sounds plausible, you could post more details and, if its appropriate, and luck is on your side, someone might even sketch out how to do an all-in-R solution. Nathan S. Watson-Haigh nathan.watson-ha...@csiro.au wrote in message news:4b5fde1b.10...@csiro.au... I have a table (contact) with several fields and it's PK is an auto increment field. I'm bulk loading data to this table from files which if successful will be about 3.5million rows (approx 16000 rows per file). However, I have a linking table (an_contact) to resolve a m:m relationship between the an and contact tables. How can I retrieve the PK's for the data bulk loaded into contact so I can insert the relevant data into an_contact. I currently load the data into contact using: dbWriteTable(con, contact, dat, append=TRUE, row.names=FALSE) But I then need to get all the PK's which this dbWriteTable() appended to the contact table so I can load the data into my an_contact link table. I don't want to issue a separate INSERT query for each row in dat and then use MySQLs LAST_INSERT_ID() functionnot when I have 3.5million rows to insert! Any pointers welcome, Nathan -- Dr. Nathan S. Watson-Haigh OCE Post Doctoral Fellow CSIRO Livestock Industries University Drive Townsville, QLD 4810 Australia Tel: +61 (0)7 4753 8548 Fax: +61 (0)7 4753 8600 Web: http://www.csiro.au/people/Nathan.Watson-Haigh.html
Re: [R] RMySQL - Bulk loading data and creating FK links
Are you claiming that SQL is that utopia? SQL is a row store. It cannot give the user the benefits of column store. For example, why does SQL take 113 seconds in the example in this thread : http://tolstoy.newcastle.edu.au/R/e9/help/10/01/1872.html but data.table takes 5 seconds to get the same result ? How come the high level language SQL doesn't appear to hide the user from this detail ? If you are just describing utopia, then of course I agree. It would be great to have a language which hid us from this. In the meantime the user has choices, and the best choice depends on the task and the real goal. Gabor Grothendieck ggrothendi...@gmail.com wrote in message news:971536df1001280428p345f8ff4v5f3a80c13f96d...@mail.gmail.com... Its only important internally. Externally its undesirable that the user have to get involved in it. The idea of making software easy to write and use is to hide the implementation and focus on the problem. That is why we use high level languages, object orientation, etc. On Thu, Jan 28, 2010 at 4:37 AM, Matthew Dowle mdo...@mdowle.plus.com wrote: How it represents data internally is very important, depending on the real goal : http://en.wikipedia.org/wiki/Column-oriented_DBMS Gabor Grothendieck ggrothendi...@gmail.com wrote in message news:971536df1001271710o4ea62333l7f1230b860114...@mail.gmail.com... How it represents data internally should not be important as long as you can do what you want. SQL is declarative so you just specify what you want rather than how to get it and invisibly to the user it automatically draws up a query plan and then uses that plan to get the result. On Wed, Jan 27, 2010 at 12:48 PM, Matthew Dowle mdo...@mdowle.plus.com wrote: sqldf(select * from BOD order by Time desc limit 3) Exactly. SQL requires use of order by. It knows the order, but it isn't ordered. Thats not good, but might be fine, depending on what the real goal is. Gabor Grothendieck ggrothendi...@gmail.com wrote in message news:971536df1001270629w4795da89vb7d77af6e4e8b...@mail.gmail.com... On Wed, Jan 27, 2010 at 8:56 AM, Matthew Dowle mdo...@mdowle.plus.com wrote: How many columns, and of what type are the columns ? As Olga asked too, it would be useful to know more about what you're really trying to do. 3.5m rows is not actually that many rows, even for 32bit R. Its depends on the columns and what you want to do with those columns. At the risk of suggesting something before we know the full facts, one possibility is to load the data from flat file into data.table. Use setkey() to set your keys. Use tables() to summarise your various tables. Then do your joins etc all-in-R. data.table has fast ways to do those sorts of joins (but we need more info about your task). Alternatively, you could check out the sqldf website. There is an sqlread.csv (or similar name) which can read your files directly into SQL read.csv.sql instead of going via R. Gabor has some nice examples there about that and its faster. You use some buzzwords which makes me think that SQL may not be appropriate for your task though. Can't say for sure (because we don't have enough information) but its possible you are struggling because SQL has no row ordering concept built in. That might be why you've created an increment In the SQLite database it automatically assigns a self incrementing hidden column called rowid to each row. e.g. using SQLite via the sqldf package on CRAN and the BOD data frame which is built into R we can display the rowid column explicitly by referring to it in our select statement: library(sqldf) BOD Time demand 1 1 8.3 2 2 10.3 3 3 19.0 4 4 16.0 5 5 15.6 6 7 19.8 sqldf(select rowid, * from BOD) rowid Time demand 1 1 1 8.3 2 2 2 10.3 3 3 3 19.0 4 4 4 16.0 5 5 5 15.6 6 6 7 19.8 field? Do your queries include order by incrementing field? SQL is not good at first and last type logic. An all-in-R solution may well be In SQLite you can get the top 3 values, say, like this (continuing the prior example): sqldf(select * from BOD order by Time desc limit 3) Time demand 1 7 19.8 2 5 15.6 3 4 16.0 better, since R is very good with ordered vectors. A 1GB data.table (or data.frame) for example, at 3.5m rows, could have 76 integer columns, or 38 double columns. 1GB is well within 32bit and allows some space for working copies, depending on what you want to do with the data. If you have 38 or less columns, or you have 64bit, then an all-in-R solution *might* get your task done quicker, depending on what your real goal is. If this sounds plausible, you could post more details and, if its appropriate, and luck is on your side, someone might even sketch out how to do an all-in-R solution. Nathan S. Watson-Haigh nathan.watson-ha...@csiro.au wrote in message news:4b5fde1b.10...@csiro.au... I have a table (contact) with several fields and it's PK is an auto increment field. I'm bulk loading data
Re: [R] RMySQL - Bulk loading data and creating FK links
I'm talking about ease of use to. The first line of the Details section in ?[.data.table says : Builds on base R functionality to reduce 2 types of time : 1. programming time (easier to write, read, debug and maintain) 2. compute time Once again, I am merely saying that the user has choices, and the best choice (and there are many choices including plyr, and lots of other great packages and base methods) depends on the task and the real goal. This choice is not restricted to compute time only, as you seem to suggest. In fact I listed programming time first (i.e ease of use). To answer your points : This is the SQL code you posted and I used in the comparison. Notice its quite long, repeats the text var1,var2,var3 4 times, contains two 'select's and a 'using'. system.time(sqldf(select var1, var2, var3, dt from a, (select var1, var2, var3, min(dt) mindt from a group by var1, var2, var3) using(var1, var2, var3) where dt - mindt 7)) user system elapsed 103.132.17 106.23 Isolating the series of operations you described : system.time(sqldf(select * from a)) user system elapsed 39.000.63 39.62 So thats roughly 40% of the time. Whats happening in the remaining 66 secs? Heres a repeat of the equivalent in data.table : system.time({adt-data.table(a)}) user system elapsed 0.900.131.03 system.time(adt[ , list(dt=dt[dt-min(dt)7]) , by=var1,var2,var3]) # is that so hard to use compared to the SQL above ? user system elapsed 3.920.784.71 I looked at the news section, but I didn't find the benchmarks quickly or easily. The links I saw took me to the FAQs. Gabor Grothendieck ggrothendi...@gmail.com wrote in message news:971536df1001280855i1d5f7c03v46f7a3e58ff93...@mail.gmail.com... I think one would only be concerned about such internals if one were primarily interested in performance; otherwise, one would be more interested in ease of specification and part of that ease is having it independent of implementation and separating implementation from specification activities. An example of separation of specification and implementation is that by simply specifying a disk-based database rather than an in-memory database SQL can perform queries that take more space than memory. The query itself need not be modified. I think the viewpoint you are discussing is primarily one of performance whereas the viewpoint I was discussing is primarily ease of use and that accounts for the difference. I believe your performance comparison is comparing a sequence of operations that include building a database, transferring data to it, performing the operation, reading it back in and destroying the database to an internal manipulation. I would expect the internal manipulation, particular one done primarily in C code as is the case with data.table, to be faster although some benchmarks of the database approach found that it compared surprisingly well to straight R code -- some users of sqldf found that for an 8000 row data frame sqldf actually ran faster than aggregate and also faster than tapply. The News section on the sqldf home page provides links to their benchmarks. Thus if R is fast enough then its likely that the database approach is fast enough too since its even faster. On Thu, Jan 28, 2010 at 8:52 AM, Matthew Dowle mdo...@mdowle.plus.com wrote: Are you claiming that SQL is that utopia? SQL is a row store. It cannot give the user the benefits of column store. For example, why does SQL take 113 seconds in the example in this thread : http://tolstoy.newcastle.edu.au/R/e9/help/10/01/1872.html but data.table takes 5 seconds to get the same result ? How come the high level language SQL doesn't appear to hide the user from this detail ? If you are just describing utopia, then of course I agree. It would be great to have a language which hid us from this. In the meantime the user has choices, and the best choice depends on the task and the real goal. Gabor Grothendieck ggrothendi...@gmail.com wrote in message news:971536df1001280428p345f8ff4v5f3a80c13f96d...@mail.gmail.com... Its only important internally. Externally its undesirable that the user have to get involved in it. The idea of making software easy to write and use is to hide the implementation and focus on the problem. That is why we use high level languages, object orientation, etc. On Thu, Jan 28, 2010 at 4:37 AM, Matthew Dowle mdo...@mdowle.plus.com wrote: How it represents data internally is very important, depending on the real goal : http://en.wikipedia.org/wiki/Column-oriented_DBMS Gabor Grothendieck ggrothendi...@gmail.com wrote in message news:971536df1001271710o4ea62333l7f1230b860114...@mail.gmail.com... How it represents data internally should not be important as long as you can do what you want. SQL is declarative so you just specify what you want rather than how to get it and invisibly to the user it automatically draws
Re: [R] RMySQL - Bulk loading data and creating FK links
sqldf(select * from BOD order by Time desc limit 3) Exactly. SQL requires use of order by. It knows the order, but it isn't ordered. Thats not good, but might be fine, depending on what the real goal is. Gabor Grothendieck ggrothendi...@gmail.com wrote in message news:971536df1001270629w4795da89vb7d77af6e4e8b...@mail.gmail.com... On Wed, Jan 27, 2010 at 8:56 AM, Matthew Dowle mdo...@mdowle.plus.com wrote: How many columns, and of what type are the columns ? As Olga asked too, it would be useful to know more about what you're really trying to do. 3.5m rows is not actually that many rows, even for 32bit R. Its depends on the columns and what you want to do with those columns. At the risk of suggesting something before we know the full facts, one possibility is to load the data from flat file into data.table. Use setkey() to set your keys. Use tables() to summarise your various tables. Then do your joins etc all-in-R. data.table has fast ways to do those sorts of joins (but we need more info about your task). Alternatively, you could check out the sqldf website. There is an sqlread.csv (or similar name) which can read your files directly into SQL read.csv.sql instead of going via R. Gabor has some nice examples there about that and its faster. You use some buzzwords which makes me think that SQL may not be appropriate for your task though. Can't say for sure (because we don't have enough information) but its possible you are struggling because SQL has no row ordering concept built in. That might be why you've created an increment In the SQLite database it automatically assigns a self incrementing hidden column called rowid to each row. e.g. using SQLite via the sqldf package on CRAN and the BOD data frame which is built into R we can display the rowid column explicitly by referring to it in our select statement: library(sqldf) BOD Time demand 118.3 22 10.3 33 19.0 44 16.0 55 15.6 67 19.8 sqldf(select rowid, * from BOD) rowid Time demand 1 118.3 2 22 10.3 3 33 19.0 4 44 16.0 5 55 15.6 6 67 19.8 field? Do your queries include order by incrementing field? SQL is not good at first and last type logic. An all-in-R solution may well be In SQLite you can get the top 3 values, say, like this (continuing the prior example): sqldf(select * from BOD order by Time desc limit 3) Time demand 17 19.8 25 15.6 34 16.0 better, since R is very good with ordered vectors. A 1GB data.table (or data.frame) for example, at 3.5m rows, could have 76 integer columns, or 38 double columns. 1GB is well within 32bit and allows some space for working copies, depending on what you want to do with the data. If you have 38 or less columns, or you have 64bit, then an all-in-R solution *might* get your task done quicker, depending on what your real goal is. If this sounds plausible, you could post more details and, if its appropriate, and luck is on your side, someone might even sketch out how to do an all-in-R solution. Nathan S. Watson-Haigh nathan.watson-ha...@csiro.au wrote in message news:4b5fde1b.10...@csiro.au... I have a table (contact) with several fields and it's PK is an auto increment field. I'm bulk loading data to this table from files which if successful will be about 3.5million rows (approx 16000 rows per file). However, I have a linking table (an_contact) to resolve a m:m relationship between the an and contact tables. How can I retrieve the PK's for the data bulk loaded into contact so I can insert the relevant data into an_contact. I currently load the data into contact using: dbWriteTable(con, contact, dat, append=TRUE, row.names=FALSE) But I then need to get all the PK's which this dbWriteTable() appended to the contact table so I can load the data into my an_contact link table. I don't want to issue a separate INSERT query for each row in dat and then use MySQLs LAST_INSERT_ID() functionnot when I have 3.5million rows to insert! Any pointers welcome, Nathan -- Dr. Nathan S. Watson-Haigh OCE Post Doctoral Fellow CSIRO Livestock Industries University Drive Townsville, QLD 4810 Australia Tel: +61 (0)7 4753 8548 Fax: +61 (0)7 4753 8600 Web: http://www.csiro.au/people/Nathan.Watson-Haigh.html __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Once again: Error: cannot allocate vector of size
Please re-read the posting guide e.g. you didn't provide an example data set or a way to generate one, or any R version information. Werner W. pensterfuz...@yahoo.de wrote in message news:646146.32238...@web23002.mail.ird.yahoo.com... Hi, I have browsed the help list and looked at the FAQ but I don't find conclusive evidence if this is normal or I am doing something wrong. I am running a lm() on a data.frame with 27136 observations of 6 variables (3 num and 3 factor). After a while R throws this: lm(log(y) ~ log(a) + log(b) + c + d + e, data=reg.data , na.action=na.exclude) Error: cannot allocate vector of size 203.7 MB This is a Windows XP 32 bit machine with 4 GB in it so that theoretically, R should be able to claim close to 2 GB. This is the gc() after the regression: used (Mb) gc trigger (Mb) max used (Mb) Ncells 272299 7.3 875833 23.4 1368491 36.6 Vcells 4526037 34.6 116536251 889.2 145524997 1110.3 memory.size(max=T) [1] 1230.25 memory.size(max=F) [1] 47.89 Looking at memory.size, R should be easily able to allocate that space, shouldn't it? Many thanks for any hints! --Werner __ Do You Yahoo!? Sie si n Massenmails. http://mail.yahoo.com __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Merging and extracting data from list
?merge plyr data.table sqldf crantastic Dr. Viviana Menzel vivianamen...@gmx.de wrote in message news:4b58a0e9.3050...@gmx.de... Hello R-help group, I have a question about merging lists. I have two lists: Genes list (hSgenes) namechrstrandstartendtransStarttransEnd symboldescriptionfeature ENSG02239721111874144121187414412 DEAD/H box polypeptide 11 like 1DEAD/H box polypeptide 11 like 3DEAD/H box polypeptide 11 like 9 ;; [Source:UniProtKB/TrEMBL;Acc:B7ZGX0]gene ENSG02272321-114363295701755129343 WASH5PWAS protein family homolog 5 pseudogene (WASH5P), non-coding RNA [Source:RefSeq DNA;Acc:NR_024540]gene . Chers list (chersList) namechrstartendcellTypeantibodyfeatures maxLevelscore chr1.cher11859132859732humanABENSG0223764 ENSG0231958 ENSG01876341.257360389683160.664381383074449 chr1.cher21889564890464humanABENSG0188976 1.478842336320642.88839131446868 chr1.cher3111063641106864humanAB ENSG01625711.837956544181153.58404359147275 In the second list, I want to add a column with the gene description (obtained from the first list). I used the following method: chersMergeGenes - data.frame(chersList,description=hSgenes$description[match(chersList$features, hSgenes$name)],symbol=hSgenes$symbol[match(chersList$features, hSgenes$name)]) write.table(chersMergeGenes, row.names=F, quote=F, sep=\t, file=chersMergeGenes.txt) and it works only partially. When chersList$features contains more than a feature (e.g. ENSG0223764 ENSG0231958 ENSG0187634), it doesn't work (NA as result). But I don't know how to split the features to obtain all descriptions. Can someone give me a hint to do this? Another problem: I have following data: $ENSG003 [1] GO:0043123 GO:0004871 $ENSG419 [1] GO:0018406 GO:0035269 GO:0006506 GO:0019348 GO:0005789 [6] GO:0005624 GO:0005783 GO:0033185 GO:0004582 GO:0004169 [11] GO:0005515 $ENSG457 [1] GO:0005737 GO:0030027 GO:0005794 GO:0005515 I want to extract a list of names ($ENSG0?) where go = GO:0005515. How can I do it? Thanks on advance Viviana -- ~~~ Dr. Viviana Menzel Rottweg 34 35428 Langgöns Tel.: +49 6403 7748550 Mobil: +49 177 5126092 E-Mail: vivianamen...@gmx.de Web: www.dres-menzel.de __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] loop on list levels and names
Great. If you mean the crantastic r package, sorry I wasn't clear, I meant the crantastic website http://crantastic.org/. If you meant the description of plyr then if the description looks useful then click the link taking you to the package documentation and read it. Same for any of the other packages. The idea, I think, is that its a good idea to make yourself aware of the most popular packages i.e. perhaps just read the descriptions of the top 30 or something like that maybe. Maybe it helps you avoid re-inventing the wheel. That seems to be the case here. Re Don's reply, sure you can use split(). But that will use more memory. And using paste for this? Ok, it works, but don't you want to use better ways? data.table should be much faster and more convenient, quicker to write than split and paste like that. HTH Ivan Calandra ivan.calan...@uni-hamburg.de wrote in message news:4b59bdc5.60...@uni-hamburg.de... I didn't know about crantastic actually. I've looked what it is exactly and it indeed looks interesting, but I don't really see how I would know that it would help me for the task. There's a description of what it was built for, but how can I then know which function from this package can help me? Thanks for your answer (you all), I'll work on it! I'll keep you informed if it doesn't work (!), and I'll go vote on crantastic when I'll have a bit more experience with the packages I use (right now I'm just using the ones I was told for one specific function), but don't worry I won't forget. As you said It only works if users contribute to it. That makes the power of R! Ivan Le 1/21/2010 19:01, Matthew Dowle a écrit : One way is : dataset = data.table(ssfamed) dataset[, whatever some functions are on Asfc, Smc, epLsar, etc, by=SPECSHOR,BONE] Your SPECSHOR and BONE names will be in your result alongside the results of thewhatever ... Or try package plyr which does this sort of thing too. And sqldf may be better if you know SQL and prefer it. There are actually zillions of ways to do it : by(), doBy() etc etc If you get your code working the way its constructed currently, its going to be very slow, because of those ==. data.table doesn't do that and is pretty fast for this kind of thing. You might find that plyr is easier to use and more flexible though if speed isn't an issue, depending on exactly what you want to do. Whichever way you decide, consider voting on crantastic for the package you end up using, and that may be a quick and easy way for you to help new R users in the future, and help us all by reducing the r-help traffic on the same subject over and over again. Note that plyr is the 2nd spot on crantastic, it would have solved your problem without needing to write that code. If you check crantastic first and make sure you're aware of popular packages, it might avoid getting stuck in this way again. It only works if users contribute to it though. Ivan Calandraivan.calan...@uni-hamburg.de wrote in message news:4b587cdd.4070...@uni-hamburg.de... Hi everybody! To use some functions, I have to transform my dataset into a list, where each element contains one group, and I have to prepare a list for each variable I have (altogether I have 15 variables, and many entries per factor level) Here is some part of my dataset: SPECSHORBONEAsfcSmcepLsar cotautx454.39036929.2616380.001136 cotautx117.4457114.2918840.00056 cotautx381.02468215.3130170.002324 cotautx159.08178918.1345330.000462 cotautm160.6415036.4113320.000571 cotautm79.2380233.8282540.001182 cotautm143.2065511.9218990.000192 cotautm115.47699633.1163860.000417 cotautm594.25623472.5381310.000477 eqgretx188.2613248.2790960.000777 eqgretx152.4442162.5963250.001022 eqgretx256.6015078.2790960.000566 eqgretx250.81644518.1345330.000535 eqgretx272.39671124.4928790.000585 eqgretm172.632644.2918840.001781 eqgretm189.44109714.4254980.001347 eqgretm170.74378813.5644720.000602 eqgretm158.96084910.3852990.001189 eqgretm80.9724083.8282540.000644 gicamtx294.4940019.6567380.000524 gicamtx267.12676519.1280240.000647 gicamtx81.8886584.7820060.000492 gicamtx168.3290812.7299390.001097 gicamtx123.2960567.0074270.000659 gicamtm94.26488718.1345330.000752 gicamtm54.3173953.8282540.00038 gicamtm55.97888317.1675340.000141 gicamtm279.59799315.3130170.000398 gicamtm288.26255618.1345330.001043 What I do next is: list_Asfc- list() list_Asfc[[1]]- ssfamed[ssfamed$SPECSHOR
Re: [R] Once again: Error: cannot allocate vector of size
Fantastic. You're much more likely to get a response now. Best of luck. werner w pensterfuz...@yahoo.de wrote in message news:1264175935970-1100164.p...@n4.nabble.com... Thanks Matthew, you are absolutely right. I am working on Windows XP SP2 32bit with R versions 2.9.1. Here is an example: d - as.data.frame(matrix(trunc(rnorm(6*27136, 1, 100)),ncol=6)) d[,4:5] - trunc(100*runif(2*27136, 0, 1)) d[,6] - trunc(1000*runif(27136, 0, 1)) for (i in 4:6) d[,i] - as.factor(d[,i]) lm(V1 ~ log(V2) + log(V3) + V4 + V5 + V6, data=d) memory.size(max=F) memory.size(max=T) I managed to get it run through after setting the 3GB switch for Windows and with a clean R session. I also noticed later, that after removing na.action=na.exclude more regressions run through. But before and after the lm() it seems there should be enough memory which means that lm() builds up some quite large objects during its computations? -- View this message in context: http://n4.nabble.com/Once-again-Error-cannot-allocate-vector-of-size-tp1083506p1100164.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] loop on list levels and names
data.table is the package name too. Make sure you find ?[.data.table which is linked from ?data.table. You could just do a mean of one variable first, and then build it up from there e.g. dataset[, mean(epLsar), by=SPECSHOR,BONE]. To get multiple columns of output, wrap with DT() like this dataset[, DT(mean(epLsar),min(epLsar)), by=SPECSHOR,BONE] Btw, v1.3 on r-forge fixes a version check warning with v1.2 on R2.10+ (not fixed by me but thanks to a contributor) so if you can't live with the warning messages, you can install v1.3 from r-forge like this : install.packages(data.table,repos=http://r-forge.r-project.org;) Best of luck. Ivan Calandra ivan.calan...@uni-hamburg.de wrote in message news:4b59d93c.5080...@uni-hamburg.de... Thanks for your advice, I will work on it then! Just one last question. In which package can I find the function data.table? Ivan Le 1/22/2010 17:18, Matthew Dowle a écrit : Great. If you mean the crantastic r package, sorry I wasn't clear, I meant the crantastic website http://crantastic.org/. If you meant the description of plyr then if the description looks useful then click the link taking you to the package documentation and read it. Same for any of the other packages. The idea, I think, is that its a good idea to make yourself aware of the most popular packages i.e. perhaps just read the descriptions of the top 30 or something like that maybe. Maybe it helps you avoid re-inventing the wheel. That seems to be the case here. Re Don's reply, sure you can use split(). But that will use more memory. And using paste for this? Ok, it works, but don't you want to use better ways? data.table should be much faster and more convenient, quicker to write than split and paste like that. HTH Ivan Calandraivan.calan...@uni-hamburg.de wrote in message news:4b59bdc5.60...@uni-hamburg.de... I didn't know about crantastic actually. I've looked what it is exactly and it indeed looks interesting, but I don't really see how I would know that it would help me for the task. There's a description of what it was built for, but how can I then know which function from this package can help me? Thanks for your answer (you all), I'll work on it! I'll keep you informed if it doesn't work (!), and I'll go vote on crantastic when I'll have a bit more experience with the packages I use (right now I'm just using the ones I was told for one specific function), but don't worry I won't forget. As you said It only works if users contribute to it. That makes the power of R! Ivan Le 1/21/2010 19:01, Matthew Dowle a écrit : One way is : dataset = data.table(ssfamed) dataset[, whatever some functions are on Asfc, Smc, epLsar, etc, by=SPECSHOR,BONE] Your SPECSHOR and BONE names will be in your result alongside the results of thewhatever ... Or try package plyr which does this sort of thing too. And sqldf may be better if you know SQL and prefer it. There are actually zillions of ways to do it : by(), doBy() etc etc If you get your code working the way its constructed currently, its going to be very slow, because of those ==. data.table doesn't do that and is pretty fast for this kind of thing. You might find that plyr is easier to use and more flexible though if speed isn't an issue, depending on exactly what you want to do. Whichever way you decide, consider voting on crantastic for the package you end up using, and that may be a quick and easy way for you to help new R users in the future, and help us all by reducing the r-help traffic on the same subject over and over again. Note that plyr is the 2nd spot on crantastic, it would have solved your problem without needing to write that code. If you check crantastic first and make sure you're aware of popular packages, it might avoid getting stuck in this way again. It only works if users contribute to it though. Ivan Calandraivan.calan...@uni-hamburg.de wrote in message news:4b587cdd.4070...@uni-hamburg.de... Hi everybody! To use some functions, I have to transform my dataset into a list, where each element contains one group, and I have to prepare a list for each variable I have (altogether I have 15 variables, and many entries per factor level) Here is some part of my dataset: SPECSHORBONEAsfcSmcepLsar cotautx454.39036929.2616380.001136 cotautx117.4457114.2918840.00056 cotautx381.02468215.3130170.002324 cotautx159.08178918.1345330.000462 cotautm160.6415036.4113320.000571 cotautm79.2380233.8282540.001182 cotautm143.2065511.9218990.000192 cotautm115.47699633.1163860.000417 cotautm594.25623472.5381310.000477 eqgretx188.2613248.2790960.000777 eqgretx152.4442162.5963250.001022 eqgretx256.6015078.2790960.000566 eqgre
Re: [R] loop on list levels and names
One way is : dataset = data.table(ssfamed) dataset[, whatever some functions are on Asfc, Smc, epLsar, etc , by=SPECSHOR,BONE] Your SPECSHOR and BONE names will be in your result alongside the results of the whatever ... Or try package plyr which does this sort of thing too. And sqldf may be better if you know SQL and prefer it. There are actually zillions of ways to do it : by(), doBy() etc etc If you get your code working the way its constructed currently, its going to be very slow, because of those ==. data.table doesn't do that and is pretty fast for this kind of thing. You might find that plyr is easier to use and more flexible though if speed isn't an issue, depending on exactly what you want to do. Whichever way you decide, consider voting on crantastic for the package you end up using, and that may be a quick and easy way for you to help new R users in the future, and help us all by reducing the r-help traffic on the same subject over and over again. Note that plyr is the 2nd spot on crantastic, it would have solved your problem without needing to write that code. If you check crantastic first and make sure you're aware of popular packages, it might avoid getting stuck in this way again. It only works if users contribute to it though. Ivan Calandra ivan.calan...@uni-hamburg.de wrote in message news:4b587cdd.4070...@uni-hamburg.de... Hi everybody! To use some functions, I have to transform my dataset into a list, where each element contains one group, and I have to prepare a list for each variable I have (altogether I have 15 variables, and many entries per factor level) Here is some part of my dataset: SPECSHORBONEAsfcSmcepLsar cotautx454.39036929.2616380.001136 cotautx117.4457114.2918840.00056 cotautx381.02468215.3130170.002324 cotautx159.08178918.1345330.000462 cotautm160.6415036.4113320.000571 cotautm79.2380233.8282540.001182 cotautm143.2065511.9218990.000192 cotautm115.47699633.1163860.000417 cotautm594.25623472.5381310.000477 eqgretx188.2613248.2790960.000777 eqgretx152.4442162.5963250.001022 eqgretx256.6015078.2790960.000566 eqgretx250.81644518.1345330.000535 eqgretx272.39671124.4928790.000585 eqgretm172.632644.2918840.001781 eqgretm189.44109714.4254980.001347 eqgretm170.74378813.5644720.000602 eqgretm158.96084910.3852990.001189 eqgretm80.9724083.8282540.000644 gicamtx294.4940019.6567380.000524 gicamtx267.12676519.1280240.000647 gicamtx81.8886584.7820060.000492 gicamtx168.3290812.7299390.001097 gicamtx123.2960567.0074270.000659 gicamtm94.26488718.1345330.000752 gicamtm54.3173953.8282540.00038 gicamtm55.97888317.1675340.000141 gicamtm279.59799315.3130170.000398 gicamtm288.26255618.1345330.001043 What I do next is: list_Asfc - list() list_Asfc[[1]] - ssfamed[ssfamed$SPECSHOR=='cotau'ssfamed$BONE=='tx', 3] list_Asfc[[2]] - ssfamed[ssfamed$SPECSHOR=='cotau'ssfamed$BONE=='tm', 3] And so on for each level of SPECSHOR and BONE I'm stuck on 2 parts: - in a loop or something similar, I would like the 1st element of the list to be filled by the values for the 1st variable with the first level of my factors (i.e. cotau + tx), and then the 2nd element with the 2nd level (i.e. cotau + tm) and so on. As shown above, I know how to do it if I enter manually the different levels, but I have no idea which function I should use so that each combination of factor will be used. See what I mean? - I would then like to run it in a loop or something for each variable. It is by itself not so complicated, but I don't know how to give the correct name to my list. I want the list containing the data for Asfc to be named list_Asfc. Here is what I tried: seq.num - c(seq(3,5,1)) #the indexes of the variables for(i in 1:length(seq.num)) { k - seq.num[i] name.num - names(ssfamed)[k] list - list() list[[1]] - ssfamed[ssfamed$SPECSHOR=='cotau'ssfamed$BONE=='tx', i] list[[2]] - ssfamed[ssfamed$SPECSHOR=='cotau'ssfamed$BONE=='tm', i] names(list) - c(cotau_tx, cotau_tm) #I have more and the 1st question should help me on that too } After names(list) I need to insert something like: name_list - list But I don't know how to give it the correct name. How do we change the name of an object? Or am I on the wrong path? Thank you in advance for your help. Ivan PS: if necessary: under Windows XP, R2.10. [[alternative HTML version deleted]] __
Re: [R] Mutliple sets of data in one dataset....Need a loop?
but I have thousands of results so it would be really hand to find away of doing this quickly its a little difficult to follow those examples Given your data in data.frame DF, maybe add the following to your list to investigate : dat = data.table(DF) dat[, cor(Score1,Score2), by=Experiment] Experiment V1 [1,] X 0.9889524 [2,] Y 0.3041195 [3,] Z -0.1346107 To do a plot instead just replace cor with plot or whatever else you want to do within each group. Since you said you have thousands of results, data.table is faster for that. In terms of ease of use, you could try plyr too, which you may well prefer. those examples as all seem so different If you look and search crantastic, users are putting their comments there. That might help you make a decision more quickly and avoid you needing to post to r-help and wait for a reply, assuming there is a package that already does what you need. Searching the history of r-help would have found many solutions to your problem this time, but it seems you are looking for advice on the best way. This changes over time and depends on lots of factors, including what you really want to do. Once you have worked out which packages work best for you, put your votes/comments onto crantastic and it should help everyone who follows in your path. I guess you should then update your votes/comments as time progresses too. Btw, plyr is ranked #2 on crantastic and is designed specifically for your task !! Making yourself aware of the most popular packages would have helped you.If you need speed try data.table. When it comes to current, up to date advice on the most appropriate package, crantastic could be fantastic, assuming of course that you, the user, contributes to it. HTH BioStudent s0975...@sms.ed.ac.uk wrote in message news:1264072645590-1049653.p...@n4.nabble.com... Hi Thanks for all your help Its a little difficult to follow those examples as all seem so different and its hard to see how I do what I want to my data from the help files but i'll try... -- View this message in context: http://n4.nabble.com/Mutliple-sets-of-data-in-one-dataset-Need-a-loop-tp1018503p1049653.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] problem of data manipulation
The user wrote in their first post : I have a lot of observations in my dataset Heres one way to do it with a data.table : a=data.table(a) ans = a[ , list(dt=dt[dt-min(dt)7]) , by=var1,var2,var3] class(ans$dt) = Date Timings are below comparing the 3 methods. In this example, data.table appears to be 28 times faster than plyr, and 24 times faster than sqldf. I excluded the one off time to build the key, since thats realistic, but even including that time, data.table is still 16 times faster than plyr (134 / (1.03+2.16+4.71)). With even more rows, it should be even bigger speedups. a - structure(list(var1 = structure(c(3L, 1L, 1L, 2L, 2L, 2L), .Label = c(c, n, s), class = factor), var2 = c(1L, 1L, 1L, 2L, 2L, 2L), var3 = c(2L, 2L, 2L, 1L, 1L, 1L), dt = structure(c(10592, 10997, 11000, 10998, 11002, 11010), class = Date)), .Names = c(var1, var2, var3, dt), row.names = c(NA, -6L), class = data.frame) a = data.frame(lapply(a,function(x)rep(x,each=100))) dim(a) [1] 600 4 library(plyr) system.time({ans1 - ddply(a, c(var1, var2, var3), subset, dt - min(dt) 7)}) user system elapsed 131.393.11 134.80 library(sqldf) system.time({ans2 - sqldf(select var1, var2, var3, dt from a, (select var1, var2, var3, min(dt) mindt from a group by var1, var2, var3) using(var1, var2, var3) where dt - mindt 7)}) user system elapsed 110.262.24 113.32 mapply(identical,ans1,ans2[order(ans2$var1),]) var1 var2 var3 dt TRUE TRUE TRUE TRUE library(data.table) system.time({adt-data.table(a)}) user system elapsed 0.900.131.03 system.time({setkey(adt,var1,var2,var3)}) user system elapsed 1.890.272.16 system.time({ans3 - adt[,list(dt=dt[dt-min(dt)7]),by=var1,var2,var3]}) user system elapsed 3.920.784.71 class(ans3$dt) = Date mapply(identical,ans1,ans3) var1 var2 var3 dt TRUE TRUE TRUE TRUE Note that in the documentaton ?[.data.table where I say that 'by' is slow, I mean relative to how fast it could be. Its seems, in this specific example anyway, and with the code posted so far, to be significantly faster than sqldf and plyr. Gabor Grothendieck ggrothendi...@gmail.com wrote in message news:971536df1001191350x3bd5d982j9879e05453760...@mail.gmail.com... Using data frame, a, from the post below this is how it would be done in SQL using sqldf. We join together the original table, a, with a table of minimums (computed by the nested select) and then choose only the rows where dt - mindt 7 (in the where clause). library(sqldf) sqldf(select var1, var2, var3, dt from a, (select var1, var2, var3, min(dt) mindt from a group by var1, var2, var3) using(var1, var2, var3) where dt - mindt 7) var1 var2 var3 dt 1s12 1999-01-01 2c12 2000-02-10 3c12 2000-02-13 4n21 2000-02-11 5n21 2000-02-15 On Tue, Jan 19, 2010 at 4:22 PM, hadley wickham h.wick...@gmail.com wrote: On Mon, Jan 18, 2010 at 1:54 PM, Bert Gunter gunter.ber...@gene.com wrote: One way to do it: 1. Convert your date column to the Date class using the as.Date() function. This allows you to do the necessary arithmetic on the dates below. dt - as.Date(a[,4],%d/%m/%Y) 2. Create a factor out of your first three columns whose levels are in the same order as the unique rows. Something likes the following should do it: fac - do.call(paste,a[,-4]) fac - factor(fac, levels=unique(fac)) This allows you to choose the groups of rows whose dates you wish to compare and maintain their correct order in the data frame 3. Then use tapply: a[unlist(tapply(dt,fac,function(x)x-min(x) 7)),] (unlist is needed to remove the list structure and concatenate the logical indices to obtain the subscripting vector). Here's the same basic approach with the plyr package: a - structure(list(var1 = structure(c(3L, 1L, 1L, 2L, 2L, 2L), .Label = c(c, n, s), class = factor), var2 = c(1, 1, 1, 2, 2, 2), var3 = c(2, 2, 2, 1, 1, 1), dt = structure(c(10592, 10997, 11000, 10998, 11002, 11010), class = Date)), .Names = c(var1, var2, var3, dt), row.names = c(NA, -6L), class = data.frame) library(plyr) ddply(a, c(var1, var2, var3), subset, dt - min(dt) 7) Hadley __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] problem of data manipulation
Sounds like a good idea. Would it be possible to give an example of how to combine plyr with data.table, and why that is better than a data.table only solution ? hadley wickham h.wick...@gmail.com wrote in message news:f8e6ff051001200624r2175e38xf558dc8fa3fb6...@mail.gmail.com... Note that in the documentaton ?[.data.table where I say that 'by' is slow, I mean relative to how fast it could be. Its seems, in this specific example anyway, and with the code posted so far, to be significantly faster than sqldf and plyr. Of course the best of both worlds would be to use data table within plyr to get both speed and a consistent syntax for other types of split-apply-combine tasks. Hadley -- http://had.co.nz/ __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.