Re: [R] Confusing behaviour in data.table: unexpectedly changing variable

2013-09-25 Thread Matthew Dowle


Very sorry to hear this bit you.  If you need a copy of names before 
changing them by reference :


oldnames - copy(names(DT))

This will be documented and it's on the bug list to do so. copy is 
needed in other circumstances too, see ?copy.


More details here :

http://stackoverflow.com/questions/18662715/colnames-being-dropped-in-data-table-in-r
http://stackoverflow.com/questions/15913417/why-does-data-table-update-namesdt-by-reference-even-if-i-assign-to-another-v

Btw, the r-help posting guide says (last time I looked) you should only 
post to r-help about packages if you have tried the maintainer first but 
didn't hear from them; i.e., r-help isn't for support about packages.


I don't follow r-help, so please continue to cc me if you reply.

Matthew

On 25/09/13 00:47, Jonathan Dushoff wrote:

I got bitten badly when a variable I created for the purpose of
recording an old set of names changed when I didn't think I was going
near it.

I'm not sure if this is a desired behaviour, or documented, or warned
about.  I read the data.table intro and the FAQ, and also ?setnames.

Ben Bolker created a minimal reproducible example:

library(data.table)
DT = data.table(x=rep(c(a,b,c),each=3), y=c(1,3,6), v=1:9)
names(DT)
## [1] x y v

oldnames - names(DT)
print(oldnames)
## [1] x y v

setnames(DT, LETTERS[1:3])
print(oldnames)
## [1] A B C



__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Problem with R CMD check and the inconsolata font business

2013-03-05 Thread Matthew Dowle


On 11/3/2011 3:30 PM, Brian Diggs wrote:


Well, I figured it out.  Or at least got it working.  I had to run

initexmf --mkmaps

because apparently there was something wrong with my font mappings.  
I
don't know why; I don't know how.  But it works now.  I think 
installing

the font into the Windows Font directory was not necessary.  I'm
including the solution in case anyone else has this problem.


Many thanks Brian Diggs! I just had the same problem and that fixed it.

Matthew

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] data.table vs plyr reg output

2012-06-29 Thread Matthew Dowle
Hi Geoff,

Please see this part of the r-help posting guide :

  For questions about functions in standard packages distributed with R
(see the FAQ Add-on packages in R), ask questions on R-help.  If the
question relates to a contributed package , e.g., one downloaded from CRAN,
try contacting the package maintainer first. You can also use
find(functionname) and packageDescription(packagename) to find this
information. ONLY send such questions to R-help or R-devel if you get no
reply or need further assistance. This applies to both requests for help and
to bug reports. 

Where I've capitalised ONLY since it is bold in the original HTML.  I only
saw your post thanks to Google Alerts.

maintainer(data.table) returns the email address of the datatable-help
list, with the posting guide in mind. However, for questions like this, I'd
suggest the data.table tag on Stack Overflow (which I subscribe to) :

http://stackoverflow.com/questions/tagged/data.table

Btw, I recently presented at LondonR.  Here's a link to the slides :

http://datatable.r-forge.r-project.org/LondonR_2012.pdf

Matthew



--
View this message in context: 
http://r.789695.n4.nabble.com/data-table-vs-plyr-reg-output-tp4634518p4634865.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] how to convert list of matrix (raster:extract o/p) to data table with additional colums (polygon Id, class)

2012-06-29 Thread Matthew Dowle

AKJ,

Please see this recent answer :

http://r.789695.n4.nabble.com/data-table-vs-plyr-reg-output-tp4634518p4634865.html

Matthew



--
View this message in context: 
http://r.789695.n4.nabble.com/how-to-convert-list-of-matrix-raster-extract-o-p-to-data-table-with-additional-colums-polygon-Id-cla-tp4634579p4634868.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] SLOW split() function

2011-10-13 Thread Matthew Dowle
Using Josh's nice example, with data.table's built-in 'by' (optimised
grouping) yields a 6 times speedup (100 seconds down to 15 on
my netbook).

 system.time(all.2b - lapply(si, function(.indx) { coef(lm(y ~ 
+ x, data=d[.indx,])) })) 
   user  system elapsed 
144.501   0.300 145.525 

 system.time(all.2c - lapply(si, function(.indx) { minimal.lm(y 
+ = d[.indx, y], x = d[.indx, list(int, x)]) })) 
   user  system elapsed 
100.819   0.084 101.552 

 system.time(all.2d - d[,minimal.lm2(y=y, x=cbind(int, x)),by=key])
   user  system elapsed 
 15.269   0.012  15.323   # 6 times faster

 head(all.2c)
$`1`
coefse
x1 0.5152438 0.6277254
x2 0.5621320 0.5754560

$`2`
coef   se
x1 0.2228235 0.312918
x2 0.3312261 0.261529

$`3`
 coefse
x1 -0.1972439 0.4674000
x2 -0.1674313 0.4479957

$`4`
  coefse
x1 -0.13915746 0.2729158
x2 -0.03409833 0.2212416

$`5`
   coefse
x1  0.007969786 0.2389103
x2 -0.083776526 0.2046823

$`6`
  coefse
x1 -0.58576454 0.5677619
x2 -0.07249539 0.5009013

 head(all.2d)
 key   coefV2
[1,]   1  0.5152438 0.6277254
[2,]   1  0.5621320 0.5754560
[3,]   2  0.2228235 0.3129180
[4,]   2  0.3312261 0.2615290
[5,]   3 -0.1972439 0.4674000
[6,]   3 -0.1674313 0.4479957

 minimal.lm2   # slightly modified version of Josh's
function(y, x) { 
  obj - lm.fit(x = x, y = y) 
  resvar - sum(obj$residuals^2)/obj$df.residual 
  p - obj$rank 
  R - .Call(La_chol2inv, x = obj$qr$qr[1L:p, 1L:p, drop = FALSE], 
size = p, PACKAGE = base) 
  m - min(dim(R)) 
  d - c(R)[1L + 0L:(m - 1L) * (dim(R)[1L] + 1L)] 
  se - sqrt(d * resvar) 
  list(coef = obj$coefficients, se) 
}
 


--
View this message in context: 
http://r.789695.n4.nabble.com/SLOW-split-function-tp3892349p3900851.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] How to map current Europe?

2011-10-13 Thread Matthew Dowle

Hi Uwe,

When you cc from Nabble it doesn't show as cc'd on r-help. It's
a web form with an Email this post to... box. I asked Nabble
support (over a year ago) if they could reflect that in the cc field of
the post they send to r-help, with no luck.

The previous thread is cited automatically in the footer: View this
message in context link.

I'm replying to this one because I happened to used Nabble to
reply in another thread, in the same way, earlier this morning.
If it isn't ok to post from Nabble, it's an option to prevent posting
from Nabble I believe.

To double check, I've sent this reply using Nabble. Did you get
the (unreflected) cc? I placed your email address in the email
this post to... box.

Matthew




--
View this message in context: 
http://r.789695.n4.nabble.com/How-to-map-current-Europe-tp3715709p3900971.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] fast or space-efficient lookup?

2011-10-10 Thread Matthew Dowle
Ivo,

Also, perhaps FAQ 2.14 helps : Can you explain further why
data.table is inspired by A[B] syntax in base?

http://datatable.r-forge.r-project.org/datatable-faq.pdf

And, 2.15 and 2.16.

Matthew

Steve Lianoglou mailinglist.honey...@gmail.com wrote in message 
news:CAHA9McPQ4P-a2imjm=szgjfxyx0faw0j79fwq2e87dqkf9j...@mail.gmail.com...
Hi Ivo,

On Mon, Oct 10, 2011 at 10:58 AM, ivo welch ivo.we...@gmail.com wrote:
 hi steve---agreed...but is there any other computer language in which
 an expression in a [ . ] is anything except a tensor index selector?

Sure, it's a type specifier in scala generics:
http://www.scala-lang.org/node/113

Something similar to scale-eez in haskell.

Aslo, MATLAB (ugh) it's not even a tensor selector (they use normal
parens there).

But I'm not sure what that has to do w/ the price of tea in china.

With data.table, [ still is tensor-selector like, though. You can
just pass in another data.table to use as the keys to do your
selection through the `i` argument (like selecting rows), which I
guess will likely be your most common use case if you're moving to
data.table (presumably you are trying to take advantage of its
quickness over big-table-like objects.

You can use the `j` param to further manipulate columns. If you pass
in a data.table as `i`, it will add its columns to `j`.

I'll grant you that it is different than your standard rectangular
object selection in R, but the motivation isn't so strange as both
i,j params in normal calls to 'xxx[i,j]' are for selecting (ok not
manipulating) rows and columns on other rectangular like objects,
too.

-steve

-- 
Steve Lianoglou
Graduate Student: Computational Systems Biology
| Memorial Sloan-Kettering Cancer Center
| Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] multicore by(), like mclapply?

2011-10-10 Thread Matthew Dowle
Package plyr has .parallel.

Searching datatable-help for multicore, say on Nabble here,

http://r.789695.n4.nabble.com/datatable-help-f2315188.html

yields three relevant posts and examples.

Please check wiki do's and don'ts to make sure you didn't
fall into one of those traps, though (we don't know data or task so
just guessing) :

http://rwiki.sciviews.org/doku.php?id=packages:cran:data.table

HTH
Matthew

ivo welch ivo.we...@gmail.com wrote in message 
news:CAPr7RtUroPQtQvoh5uBuT60OYkwGR+ufGr_Z=g5g+vljeoj...@mail.gmail.com...
 dear r experts---Is there a multicore equivalent of by(), just like
 mclapply() is the multicore equivalent of lapply()?

 if not, is there a fast way to convert a data.table into a list based
 on a column that lapply and mclapply can consume?

 advice appreciated...as always.

 regards,

 /iaw
 
 Ivo Welch (ivo.we...@gmail.com)


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Efficient way to do a merge in R

2011-10-04 Thread Matthew Dowle

Joshua Wiley jwiley.ps...@gmail.com wrote in message 
news:canz9z_kopuwkzb-zxr96pvulhhf2znxntxso9xnyho-_jum...@mail.gmail.com...
 On Tue, Oct 4, 2011 at 12:40 AM, Rainer Schuermann
 rainer.schuerm...@gmx.net wrote:
 Any comments are very welcome,

 3. If that fails, and nobody else has a better idea, I would consider 
 using a database engine for the job.

 Not a bad idea for working with large datasets either.

or, the data.table package
http://datatable.r-forge.r-project.org/

Matthew

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] cannot install.packages(data.table)

2011-10-04 Thread Matthew Dowle
Assuming you can install other packages ok, data.table depends on
R =2.12.0. Which version of R do you have?

_If_ that's the problem, does anyone know if anything prevents
R's error message from stating which dependency isn't satisfied? I think
I've seen users confused by this before, for other packages too.

Matthew

Emmanuel Mayssat emays...@gmail.com wrote in message 
news:cacb6zmctdrjkbftqrw+tv2owptrkgwytc_-hvvtguzwu9gq...@mail.gmail.com...
Hello,

I am new at R.
I am trying to see if R can work for me.
I need to do database like lookup (select * from table where
name=='toto') and work with matrix (transpose, add columns, remove
rows, etc).
It seems that the data.table package can help.
http://rwiki.sciviews.org/doku.php?id=packages:cran:data.table

I installed R and ...

 install.packages(data.table)
Warning in install.packages(data.table) :
  argument 'lib' is missing: using '/usr/local/lib/R/site-library'
Warning message:
In getDependencies(pkgs, dependencies, available, lib) :
  package ‘data.table’ is not available

 install.packages()
doesn't show the package.

where can I find it?

--
Emmanuel

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] formatting a 6 million row data set; creating a censoring variable

2011-09-01 Thread Matthew Dowle

This is the fastest data.table way I can think of :

ans = mydt[,list(mytime=.N),by=list(id,mygroup)]
ans[,censor:=0L]
ans[J(unique(id)), censor:=1L, mult=last]
 id mygroup mytime censor
[1,]  1   A  1  1
[2,]  2   B  3  0
[3,]  2   C  3  0
[4,]  2   D  6  1
[5,]  3   A  3  0
[6,]  3   B  3  1
[7,]  4   A  1  1

  I'll post the timings on the real data set shortly.
Please do.

Matthew


William Dunlap wdun...@tibco.com wrote in message 
news:e66794e69cfde04d9a70842786030b9304e...@pa-mbx04.na.tibco.com...
 I'll assume that all of an individual's data rows
 are contiguous and that an individual always passes through
 the groups in order (or, least, the individual
 never leaves a group and then reenters it), so we
 can find everything we need to know by comparing each
 row with the previous row.

 You can use rle() to quickly make the time
 column:
   rle(paste(d$mygroup, d$id))$lengths
  [1] 1 3 3 6 3 3 1

 For the censor column it is probably easiest to consider
 what rle() must do internally and use a modification of that.
 E.g.,
  isFirstInRun - function(x) c(TRUE, x[-1] != x[-length(x)])
  isLastInRun - function(x) c(x[-1] != x[-length(x)], TRUE)
  outputRows - isLastInRun(d$mygroup) | isLastInRun(d$id)
  output - d[outputRows, ]
  output$mytime - diff(c(0, which(outputRows)))
  output$censor - as.integer(isLastInRun(e$id))
 which gives you
   output
 gender mygroup id mytimes censor
  1   F   A  1   1  1
  4   F   B  2   3  0
  7   F   C  2   3  0
  13  F   D  2   6  1
  16  M   A  3   3  0
  19  M   B  3   3  1
  20  M   A  4   1  1
 You showed a rearrangment of the columns
   output[, c(id, mygroup, mytime, censor)]
 id mygroup mytime censor
  1   1   A  1  1
  4   2   B  3  0
  7   2   C  3  0
  13  2   D  6  1
  16  3   A  3  0
  19  3   B  3  1
  20  4   A  1  1
 This ought to be quicker than plyr, but data.table
 may do similar run-oriented operations.

 Bill Dunlap
 Spotfire, TIBCO Software
 wdunlap tibco.com

 -Original Message-
 From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] 
 On Behalf Of Juliet Hannah
 Sent: Wednesday, August 31, 2011 10:51 AM
 To: r-help@r-project.org
 Subject: [R] formatting a 6 million row data set; creating a censoring 
 variable

 List,

 Consider the following data.

gender mygroup id
 1   F   A  1
 2   F   B  2
 3   F   B  2
 4   F   B  2
 5   F   C  2
 6   F   C  2
 7   F   C  2
 8   F   D  2
 9   F   D  2
 10  F   D  2
 11  F   D  2
 12  F   D  2
 13  F   D  2
 14  M   A  3
 15  M   A  3
 16  M   A  3
 17  M   B  3
 18  M   B  3
 19  M   B  3
 20  M   A  4

 Here is the reshaping I am seeking (explanation below).

  id mygroup mytime censor
 [1,]  1   A  1  1
 [2,]  2   B  3  0
 [3,]  2   C  3  0
 [4,]  2   D  6  1
 [5,]  3   A  3  0
 [6,]  3   B  3  1
 [7,]  4   A  1  1

 I need to create 2 variables. The first one is a time variable.
 Observe that for id=2, the variable mygroup=B was observed 3 times. In
 the solution we see in row 2 that id=2 has a mytime variable of 3.

 Next, I need to create a censoring variable.

 Notice id=2 goes through has values of B, C, D for mygroup. This means
 the change from B to C and C to D is observed.  There is no change
 from D. I need to indicate this with a 'censoring' variable. So B and
 C would have values 0, and D would have a value of 1. As another
 example, id=1 never changes, so I assign it  censor= 1. Overall, if a
 change is observed, 0 should be assigned, and if a change is not
 observed 1 should be assigned.

 One potential challenge is that the original data set has over 5
 million rows. I have ideas, but I'm still getting used the the
 data.table and plyr syntax.  I also seek a base R solution. I'll post
 the timings on the real data set shortly.

 Thanks for your help.

  sessionInfo()
 R version 2.13.1 (2011-07-08)
 Platform: x86_64-unknown-linux-gnu (64-bit)

 locale:
  [1] LC_CTYPE=en_US.UTF-8   LC_NUMERIC=C
  [3] LC_TIME=en_US.UTF-8LC_COLLATE=en_US.UTF-8
  [5] LC_MONETARY=C  LC_MESSAGES=en_US.UTF-8
  [7] LC_PAPER=en_US.UTF-8   LC_NAME=C
  [9] LC_ADDRESS=C   LC_TELEPHONE=C
 [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

 attached base packages:
 [1] stats graphics  grDevices utils datasets  methods   base

 # Here is a simplified data set

 myData - structure(list(gender = c(F, F, F, F, F, F, F,
 F, F, F, F, F, F, M, M, M, M, M, M, M
 ), mygroup = c(A, B, B, B, C, C, C, D, D, D,
 D, D, D, 

Re: [R] ddply from plyr package - any alternatives?

2011-08-30 Thread Matthew Dowle
Adam,

 because I did not have time to entirely test

Do you (or does your company) have an automated test suite in place?

R 2.10.0 is nearly two years old,  and R 2.12.0 is nearly one.

Matthew

AdamMarczak adam.marc...@gmail.com wrote in message 
news:1314385041626-3771731.p...@n4.nabble.com...
 No, it's not much faster. I'd say it's faster about 10-15% in my case.

 I dont want neither plyr or data.table package because our software on the
 server does not support R version  over 2.10 and both of them have
 dependency for R = 2.12. Also I do not want to use old archives because I
 did not have time to entirely test them as it was quick demand for
 workaround.

 Best regards,
 Adam.

 --
 View this message in context: 
 http://r.789695.n4.nabble.com/ddply-from-plyr-package-any-alternatives-tp3765936p3771731.html
 Sent from the R help mailing list archive at Nabble.com.


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Sequential Naming of ggplot .pngs using plyr

2011-08-11 Thread Matthew Dowle

Hi Justin,

In data.table 1.6.1 there was this news item :

oj's environment is now consistently reused so
that local variables may be set which persist
from group to group; e.g., incrementing a group
counter :
DT[,list(z,groupInd-groupInd+1),by=x]

One of the reasons data.table is fast is that there is no function
run per group. It's just that j expression. That's run in the same
persistent environment for each group, so you can do things
like increment a group counter within it.

If your data were in 'long' format (data.table prefers long format,
like a database) it might be something like (the ggplot line is untested) :

ctr = 1
DT[,{ 
png(file=paste('/tmp/plot_number_',ctr,'.png',sep=''),height=8.5,width=11,units='in',pointsize=9,res=300)
  print(ggplot(aes(x=site,y=val))+geom_boxplot()+opts(title=paste('plot 
number',ctr,sep=' ')))
  dev.off()
  ctr-ctr+1 },
   by=site]

Btw, there was a new feature in 1.6.3, where you can subassign
into data.table 500 times faster than -.  See the NEWS from
1.6.3 for an example :

http://datatable.r-forge.r-project.org/

Matthew


Justin Haynes jto...@gmail.com wrote in message 
news:CAFaj53kjqy=1bJy+iLjeeLYKgvx=rte2h_ha24pt20wqvch...@mail.gmail.com...
 Thanks Ista,

 In my real code that is exactly what I'm doing, but I want to prepend the
 names with a sequential number for easier reference once the pngs are 
 made.

 My initial thought was to add the sequential number to the data before
 sending it to plyr and drawing it out there, but that seems like an
 excessive extra step when I have 1e6 - 1e7 rows.


 Justin


 On Wed, Aug 10, 2011 at 2:42 PM, Ista Zahn 
 iz...@psych.rochester.eduwrote:

 Hi Justin,

 On Wed, Aug 10, 2011 at 5:04 PM, Justin Haynes jto...@gmail.com wrote:
  If I have data:
 
 
 dat-data.frame(a=rnorm(20),b=rnorm(20),c=rnorm(20),d=rnorm(20),site=rep(letters[5:8],each=5))
 
  And want to plot like this:
 
  ctr-1
  for(i in c('a','b','c','d')){
 png(file=paste('/tmp/plot_number_',ctr,'.png',sep=''),height=8.5,
  width=11,units='in',pointsize=9,res=300)
 print(ggplot(dat[,names(dat) %in%
 
 c('site',i)],aes(x=factor(site),y=dat[,i]))+geom_boxplot()+opts(title=paste('plot
  number',ctr,sep=' ')))
 dev.off()
 ctr-ctr+1
  }
 
  Is there a way to do the same naming using plyr (or data.table or 
  foreach
  which I am not familiar with at all!)?

 This is not the same naming, but the same general idea can be
 achieved with plyr using

  d_ply(melt(dat,id.vars='site'),.(variable),function(df) {
 png(file=paste(plyr_plot, unique(df$variable),
 .png),height=8.5,width=11,units='in',pointsize=9,res=300)
 print(ggplot(df,aes(x=factor(site),y=value))+geom_boxplot())
 dev.off()
  })

 I'm not up to speed on .parallel, foreach etc., so I'l leave the rest
 to someone else.

 Best,
 Ista
 
  m.dat-melt(dat,id.vars='site')
  ddply(m.dat,.(variable),function(df)
  print(ggplot(df,aes(x=factor(site),y=value))+geom_boxplot()+ ..?)
 
  And better yet, is there a way to do it using .parallel=T?
 
  Faceting is not really an option (unless I can facet onto multiple 
  pages
 of
  a pdf or something) because these need to go into reports as 
  individually
  labelled and titled plots.
 
 
  As a bit of a corollary, is it really worth the headache to resolve 
  this
 if
  I am only using melt/plyr to split on the four letter variables? With a
  larger set of data (1e6 rows), the melt/plyr version takes a 
  significant
  amount of time but .parallel=T drops the time significantly.  Is the
 right
  answer a foreach loop and can I do that with the increasing counter? (I
  haven't gotten beyond Hadley's .parallel feature in my parallel R
  dealings.)
 
 
 
 dat-data.frame(a=rnorm(1e6),b=rnorm(1e6),c=rnorm(1e6),d=rnorm(1e6),site=rep(letters[5:8],each=2.5e5))
  ctr-1
  system.time(for(i in c('a','b','c','d')){
  + png(file=paste('/tmp/plot_number_',ctr,'.png',sep=''),height=8.5,
  width=11,units='in',pointsize=9,res=300)
  + print(ggplot(dat[,names(dat) %in%
 
 c('site',i)],aes(x=factor(site),y=dat[,i]))+geom_boxplot()+opts(title=paste('plot
  number',ctr,sep=' ')))
  + dev.off()
  + ctr-ctr+1
  + })
user  system elapsed
   54.630   0.120  54.843
 
  system.time(
  + ddply(melt(dat,id.vars='site'),.(variable),function(df) {
  +
 
 png(file='/tmp/plyr_plot.png',height=8.5,width=11,units='in',pointsize=9,res=300)
  + print(ggplot(df,aes(x=factor(site),y=value))+geom_boxplot())
  + dev.off()
  + },.parallel=F)
  + )
user  system elapsed
   58.400.13   58.63
 
  system.time(
  + ddply(melt(dat,id.vars='site'),.(variable),function(df) {
  +
 
 png(file='/tmp/plyr_plot.png',height=8.5,width=11,units='in',pointsize=9,res=300)
  + print(ggplot(df,aes(x=factor(site),y=value))+geom_boxplot())
  + dev.off()
  + },.parallel=T)
  + )
user  system elapsed
   70.333.46   27.61
 
 
  How might I speed this up and include the sequential 

Re: [R] EXTERNAL: Re: subset with aggregate key

2011-07-13 Thread Matthew Dowle
To close this thread on-list :
packageVersion() was added to R in 2.12.0.
data.table's dependency on 2.12.0 is updated, thanks.
Matthew

Jesse Brown jesse.r.br...@lmco.com wrote in message 
news:4e1b21a8.8090...@atl.lmco.com...
 Matthew Dowle wrote:
 Hi,

 Try package 'data.table'. It has a concept of keys which allows you to do
 exactly that.

 http://datatable.r-forge.r-project.org/

 Matthew


 Hi Matthew,

 Unfortunately, the load of that library fails (it builds successfully).
 I'm currently looking into why. Error output looks something similar to:


  library(data.table)
 Error in .makeMessage(..., domain = domain, appendLF = appendLF) :
  could not find function packageVersion
 Error : .onAttach failed in 'attachNamespace'
 Error: package/namespace load failed for 'data.table'


 I googled around a bit and there is mention of a bug in packageVersion but 
 there was no solution that I found. Is this something that is easily 
 overcome?


 Thanks,

 Jesse


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] manipulating by lists and ave() functions

2011-07-11 Thread Matthew Dowle

Users of package 'unknownR' already know simplify2array was added in R 
2.13.0.

They also know what else was added.  Do you?

http://unknownr.r-forge.r-project.org/


Joshua Wiley jwiley.ps...@gmail.com wrote in message 
news:canz9z_j+trwoim3scayuaruors+8hyc30pmt_thiex6qmto...@mail.gmail.com...
 On Sat, Jul 9, 2011 at 7:32 AM, David Winsemius dwinsem...@comcast.net 
 wrote:

 On Jul 9, 2011, at 9:44 AM, Berry Boessenkool wrote:


 Maybe I'm missing something, but in what package do I find that 
 function?

 simplify2array(b)

 Fehler: konnte Funktion simplify2array nicht finden
 # Function wasn't found

 help.search(simplify2array)

 No help files found with alias or concept or title matching
 'simplify2array' using fuzzy matching.


 Perhaps its new, since ?simplify2array brings up a help page and it's in
 base. Try updating.

 Yes, simplify2array() was added in R 2.13.0 to support the simplify =
 array argument to sapply().

 Josh

 [snip]

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide 
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Simple order() data frame question.

2011-05-12 Thread Matthew Dowle

With data.table, the following is routine :

DT[order(a)]   # ascending
DT[order(-a)]  # descending, if a is numeric
DT[a5,sum(z),by=c][order(-V1)]   # sum of z group by c, just where a5, 
then show me the largest first
DT[order(-a,b)]  # order by a descending then by b ascending, if a and b are 
both numeric

It avoids peppering your code with $, and becomes quite natural after a 
short while; especially compound queries such as the 3rd example.

Matthew

http://datatable.r-forge.r-project.org/


Ivan Calandra ivan.calan...@uni-hamburg.de wrote in message 
news:4dcbec8b.6040...@uni-hamburg.de...
I was wondering whether it would be possible to make a method for
data.frame with sort().
I think it would be more intuitive than using the complex construction
of df[order(df$a),]
Is there any reason not to make it?

Ivan

Le 5/12/2011 15:40, Marc Schwartz a écrit :
 On May 12, 2011, at 8:09 AM, John Kane wrote:

 Argh.  I knew it was at least partly obvious.  I never have been able to 
 read the order() help page and understand what it is saying.

 THanks very much.

 By the way, to me it is counter-intuitive that the the command is

 df1[order(df1[,2],decreasing=TRUE),]
 For some reason I keep expecting it to be
 order( , df1[,2],decreasing=TRUE)

 So clearly I don't understand what is going on but at least I a lot 
 better off.  I may be able to get this graph to work.

 John,

 Perhaps it may be helpful to understand that order() does not actually 
 sort() the data.

 It returns a vector of indices into the data, where those indices are the 
 sorted ordering of the elements in the vector, or in this case, the 
 column.

 So you want the output of order() to be used within the brackets for the 
 row *indices*, to reflect the ordering of the column (or columns in the 
 case of a multi-level sort) that you wish to use to sort the data frame 
 rows.

 set.seed(1)
 x- sample(10)

 x
   [1]  3  4  5  7  2  8  9  6 10  1


 # sort() actually returns the sorted data
 sort(x)
   [1]  1  2  3  4  5  6  7  8  9 10


 # order() returns the indices of 'x' in sorted order
 order(x)
   [1] 10  5  1  2  3  8  4  6  7  9


 # This does the same thing as sort()
 x[order(x)]
   [1]  1  2  3  4  5  6  7  8  9 10


 set.seed(1)
 df1- data.frame(aa = letters[1:10], bb = rnorm(10))

 df1
 aa bb
 1   a -0.6264538
 2   b  0.1836433
 3   c -0.8356286
 4   d  1.5952808
 5   e  0.3295078
 6   f -0.8204684
 7   g  0.4874291
 8   h  0.7383247
 9   i  0.5757814
 10  j -0.3053884


 # These are the indices of df1$bb in sorted order
 order(df1$bb)
   [1]  3  6  1 10  2  5  7  9  8  4


 # Get df1$bb in increasing order
 df1$bb[order(df1$bb)]
   [1] -0.8356286 -0.8204684 -0.6264538 -0.3053884  0.1836433  0.3295078
   [7]  0.4874291  0.5757814  0.7383247  1.5952808


 # Same thing as above
 sort(df1$bb)
   [1] -0.8356286 -0.8204684 -0.6264538 -0.3053884  0.1836433  0.3295078
   [7]  0.4874291  0.5757814  0.7383247  1.5952808


 You can't use the output of sort() to sort the data frame rows, so you 
 need to use order() to get the ordered indices and then use that to 
 extract the data frame rows in the sort order that you desire:

 df1[order(df1$bb), ]
 aa bb
 3   c -0.8356286
 6   f -0.8204684
 1   a -0.6264538
 10  j -0.3053884
 2   b  0.1836433
 5   e  0.3295078
 7   g  0.4874291
 9   i  0.5757814
 8   h  0.7383247
 4   d  1.5952808


 df1[order(df1$bb, decreasing = TRUE), ]
 aa bb
 4   d  1.5952808
 8   h  0.7383247
 9   i  0.5757814
 7   g  0.4874291
 5   e  0.3295078
 2   b  0.1836433
 10  j -0.3053884
 1   a -0.6264538
 6   f -0.8204684
 3   c -0.8356286


 Does that help?

 Regards,

 Marc Schwartz

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide 
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.


-- 
Ivan CALANDRA
PhD Student
University of Hamburg
Biozentrum Grindel und Zoologisches Museum
Abt. Säugetiere
Martin-Luther-King-Platz 3
D-20146 Hamburg, GERMANY
+49(0)40 42838 6231
ivan.calan...@uni-hamburg.de

**
http://www.for771.uni-bonn.de
http://webapp5.rrz.uni-hamburg.de/mammals/eng/1525_8_1.php

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] [R-pkgs] unknownR : you didn't know you didn't know?

2011-04-28 Thread Matthew Dowle

Do you know how many functions there are in base R?
How many of them do you know you don't know?
Run unk() to discover your unknown unknowns.
It's fast and it's fun!

unknownR v0.2 is now on CRAN.

More information is on the homepage :

http://unknownr.r-forge.r-project.org/

Or, just install the package and try it :

install.packages(unknownR)
library(unknownR)
?unk
unk()
learn()

Matthew

___
R-packages mailing list
r-packa...@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-packages

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] [R-pkgs] data.table 1.6 is now on CRAN

2011-04-28 Thread Matthew Dowle

data.table offers fast subset, fast grouping and fast ordered joins in a
short and flexible syntax, for faster development. It was first released
in August 2008 and is now the 3rd most popular package on Crantastic
with 20 votes and 7 reviews.

* X[Y] is a fast join for large data.
* X[,sum(b*c),by=a] is fast aggregation.
* 10+ times faster than tapply()
* 100+ times faster than ==

It inherits from data.frame. It is compatible with packages that only
accept data.frame.

This is a major release that adds S4 compatibility to the package,
contributed by Steve Lianoglou.

Recently the FAQs have been revised and ?data.table has been simplified
with shorter and easier examples. There is a wiki (with content), three
vignettes, a video, a NEWS file and an active user community.

http://datatable.r-forge.r-project.org/

http://unknownr.r-forge.r-project.org/toppkgs.html


Matthew, Tom and Steve

___
R-packages mailing list
r-packa...@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-packages

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] R licence

2011-04-07 Thread Matthew Dowle
Peter,

If the proprietary part of REvolution's product is ok, then surely 
Stanislav's suggestion is too. No?

Matthew


peter dalgaard pda...@gmail.com wrote in message 
news:be157cf5-9b4b-45a0-a7d4-363b774f1...@gmail.com...

 On Apr 7, 2011, at 09:45 , Stanislav Bek wrote:

 Hi,

 is it possible to use some statistic computing by R in proprietary 
 software?
 Our software is written in c#, and we intend to use
 http://rdotnet.codeplex.com/
 to get R work there. Especially we want to use loess function.

 You need to take legal advice to be certain, but offhand I would say that 
 this kind of circumvention of the GPL is _not_ allowed.

 It all depends on whether the end product is a derivative work, in which 
 case, the whole must be distributed under a GPL-compatible licence. The 
 situation around GPL-incompatible plug-ins or plug-ins interfacing to R in 
 GPL -incompatible software is legally murky, but using R as a subroutine 
 library for proprietary code is clearly crossing the line, as far as I can 
 tell.

 -- 
 Peter Dalgaard
 Center for Statistics, Copenhagen Business School
 Solbjerg Plads 3, 2000 Frederiksberg, Denmark
 Phone: (+45)38153501
 Email: pd@cbs.dk  Priv: pda...@gmail.com


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] R licence

2011-04-07 Thread Matthew Dowle
Duncan,

Letting you know then that I just don't see how the first paragraph here :

http://www.revolutionanalytics.com/downloads/gpl-sources.php

is compatible with clause 2(b) here :

http://www.gnu.org/licenses/gpl-2.0.html

Perhaps somebody could explain why it is?

Matthew


Duncan Murdoch murdoch.dun...@gmail.com wrote in message 
news:4d9da9ff.9020...@gmail.com...
 On 07/04/2011 7:47 AM, Matthew Dowle wrote:
 Peter,

 If the proprietary part of REvolution's product is ok, then surely
 Stanislav's suggestion is too. No?

 Revolution has said that they believe they follow the GPL, and they 
 haven't been challenged on that.   If you think that they don't, you could 
 let an R copyright holder know what they're doing that's a license 
 violation.

 My opinion of Stanislav's question is that he doesn't give enough 
 information to answer.  If he is planning to distribute R as part of his 
 product, he needs to follow the GPL.  If not, I don't think any R 
 copyright holder has anything to complain about.

 Duncan Murdoch

 Matthew


 peter dalgaardpda...@gmail.com  wrote in message
 news:be157cf5-9b4b-45a0-a7d4-363b774f1...@gmail.com...
 
   On Apr 7, 2011, at 09:45 , Stanislav Bek wrote:
 
   Hi,
 
   is it possible to use some statistic computing by R in proprietary
   software?
   Our software is written in c#, and we intend to use
   http://rdotnet.codeplex.com/
   to get R work there. Especially we want to use loess function.
 
   You need to take legal advice to be certain, but offhand I would say 
  that
   this kind of circumvention of the GPL is _not_ allowed.
 
   It all depends on whether the end product is a derivative work, in 
  which
   case, the whole must be distributed under a GPL-compatible licence. 
  The
   situation around GPL-incompatible plug-ins or plug-ins interfacing to 
  R in
   GPL -incompatible software is legally murky, but using R as a 
  subroutine
   library for proprietary code is clearly crossing the line, as far as I 
  can
   tell.
 
   -- 
   Peter Dalgaard
   Center for Statistics, Copenhagen Business School
   Solbjerg Plads 3, 2000 Frederiksberg, Denmark
   Phone: (+45)38153501
   Email: pd@cbs.dk  Priv: pda...@gmail.com
 

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide 
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] General binary search?

2011-04-05 Thread Matthew Dowle
Try data.table:::sortedmatch, which is implemented in C.
It requires it's input to be sorted (and doesn't check)

Stavros Macrakis macra...@alum.mit.edu wrote in message 
news:BANLkTi=j2lf5syxytv1dd4k9wr0zgk8...@mail.gmail.com...
 Is there a generic binary search routine in a standard library which

   a) works for character vectors
   b) runs in O(log(N)) time?

 I'm aware of findInterval(x,vec), but it is restricted to numeric vectors.

 I'm also aware of various hashing solutions (e.g. new.env(hash=TRUE) and
 fastmatch), but I need the greatest-lower-bound match in my application.

 findInterval is also slow for large N=length(vec) because of the O(N)
 checking it does, as Duncan Murdoch has pointed
 outhttps://stat.ethz.ch/pipermail/r-help/2008-September/174584.html:
 though
 its documentation says it runs in O(n * log(N)), it actually runs in O(n *
 log(N) + N), which is quite noticeable for largish N.  But that is easy
 enough to work around by writing a variant of findInterval which calls
 find_interv_vec without checking.

-s

 PS Yes, binary search is a one-liner in R, but I always prefer to use
 standard, fast native libraries when possible

 binarysearch - function(val,tab,L,H) {while (H=L) { M=L+(H-L) %/% 2; if
 (tab[M]val) H-M-1 else if (tab[M]val) L-M+1 else return(M)};
 return(L-1)}

 [[alternative HTML version deleted]]


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] How to calculate means for multiple variables in samples with different sizes

2011-03-11 Thread Matthew Dowle
Hi,

One liners in data.table are :

 x.dt[,lapply(.SD,mean),by=sample]
 sample replicate   heightweight  age
[1,]  A   2.0 12.2 0.503 6.00
[2,]  B   1.5 12.75000 0.715 4.50
[3,]  C   2.5 11.35250 0.5125000 3.75
[4,]  D   2.0 14.99333 0.673 5.33

without the replicate column :

 x.dt[,lapply(list(height,weight,age),mean),by=sample]
 sample   V1V2   V3
[1,]  A 12.2 0.503 6.00
[2,]  B 12.75000 0.715 4.50
[3,]  C 11.35250 0.5125000 3.75
[4,]  D 14.99333 0.673 5.33

one (long) way to retain the column names :

 x.dt[,lapply(list(height=height,weight=weight,age=age),mean),by=sample]
 sample   heightweight  age
[1,]  A 12.2 0.503 6.00
[2,]  B 12.75000 0.715 4.50
[3,]  C 11.35250 0.5125000 3.75
[4,]  D 14.99333 0.673 5.33


or this is shorter :

 ans = x.dt[,lapply(.SD,mean),by=sample]
 ans$replicate = NULL
 ans
 sample   heightweight  age
[1,]  A 12.2 0.503 6.00
[2,]  B 12.75000 0.715 4.50
[3,]  C 11.35250 0.5125000 3.75
[4,]  D 14.99333 0.673 5.33


or another way :

 mycols = c(height,weight,age)
 x.dt[,lapply(.SD[,mycols,with=FALSE],mean),by=sample]
 sample   heightweight  age
[1,]  A 12.2 0.503 6.00
[2,]  B 12.75000 0.715 4.50
[3,]  C 11.35250 0.5125000 3.75
[4,]  D 14.99333 0.673 5.33


or another way :

 x.dt[,lapply(.SD[,list(height,weight,age)],mean),by=sample]
 sample   heightweight  age
[1,]  A 12.2 0.503 6.00
[2,]  B 12.75000 0.715 4.50
[3,]  C 11.35250 0.5125000 3.75
[4,]  D 14.99333 0.673 5.33


The way Jim showed :

 x.dt[, list(height = mean(height)
+, weight = mean(weight)
+, age = mean(age)
+), by = sample]

is the more flexible syntax for when you want different functions on 
different columns, easily, and as a bonus is fast.

Matthew


Dennis Murphy djmu...@gmail.com wrote in message 
news:AANLkTimxXL8BqTaYKUb=saee2cra9fosfuap4qzkx...@mail.gmail.com...
 Hi:

 Here are a few one-liners. Calling your data frame dd,

 aggregate(cbind(height, weight, age) ~ sample, data = dd, FUN = mean)
  sample   heightweight  age
 1  A 12.2 0.503 6.00
 2  B 12.75000 0.715 4.50
 3  C 11.35250 0.5125000 3.75
 4  D 14.99333 0.673 5.33

 With package doBy:

 library(doBy)
 summaryBy(height + weight + age ~ sample, data = dd, FUN = mean)
  sample height.mean weight.mean age.mean
 1  A12.2   0.503 6.00
 2  B12.75000   0.715 4.50
 3  C11.35250   0.5125000 3.75
 4  D14.99333   0.673 5.33

 With package plyr:

 library(plyr)
 ddply(dd, .(sample), colwise(mean, .(height, weight, age)))
  sample   heightweight  age
 1  A 12.2 0.503 6.00
 2  B 12.75000 0.715 4.50
 3  C 11.35250 0.5125000 3.75
 4  D 14.99333 0.673 5.33

 Dennis

 On Fri, Mar 11, 2011 at 1:32 AM, Aline Santos aline...@gmail.com wrote:

 Hello R-helpers:

 I have data like this:

 samplereplicateheightweightage
 A1.0012.00.646.00
 A2.0012.20.386.00
 A3.0012.40.496.00
 B1.0012.70.654.00
 B2.0012.80.785.00
 C1.0011.90.456.00
 C2.0011.840.442.00
 C3.0011.430.323.00
 C4.0010.240.844.00
 D1.0014.20.542.00
 D2.0015.670.677.00
 D3.0015.110.817.00

 Now, how can I calculate the mean for each condition (heigth, weigth, 
 age)
 in each sample, considering the samples have different number of
 replicates?


 The final matrix should look like:

 sampleheightweightage
 A12.200.506.00
 B 12.75  0.72  4.50
 C 11.35  0.51  3.75
 D 14.99  0.67  5.33

 This is a simplified version of my dataset, which consist of 100 samples
 (unequally distributed in 530 replicates) for 600 different conditions.

 I appreciate all the help.

 A.S.

[[alternative HTML version deleted]]

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.


 [[alternative HTML version deleted]]


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Transforming relational data

2011-02-22 Thread Matthew Dowle

With the new example, what is the full output, and
what do you need instead? Was it correct for the
previous example?

Matthew

mathijsdevaan mathijsdev...@gmail.com wrote in message 
news:1298372018181-3318939.p...@n4.nabble.com...

 Hi Matthew, thanks for your help. There are some things going wrong still.
 Consider this (slightly extended) example:

 library(data.table)
 DT = data.table(read.table(textConnection(A  B  C
 1 1  a  1999
 2 1  b  1999
 3 1  c  1999
 4 1  d  1999
 5 2  c  2001
 6 2  d  2001
 7 3  a  2004
 8 3  b  2004
 9 3  d  2004
 10 4  c  2001
 11 4  d  2001),head=TRUE,stringsAsFactors=FALSE))
 firststep = DT[,cbind(A,expand.grid(B,B),v=1/length(B)),by=C][Var1!=Var2]
 firststep
  C A Var1 Var2 v
 1  1999 1ba 0.250
 2  1999 1ca 0.250
 3  1999 1da 0.250
 4  1999 1ab 0.250
 5  1999 1cb 0.250
 6  1999 1db 0.250
 7  1999 1ac 0.250
 8  1999 1bc 0.250
 9  1999 1dc 0.250
 10 1999 1ad 0.250
 11 1999 1bd 0.250
 12 1999 1cd 0.250
 13 2001 2ba 0.250
 14 2001 4ba 0.250
 15 2001 2ab 0.250
 16 2001 4ab 0.250
 17 2001 2ba 0.250
 18 2001 4ba 0.250
 19 2001 2ab 0.250
 20 2001 4ab 0.250
 21 2004 3ba 0.333
 22 2004 3ca 0.333
 23 2004 3ab 0.333
 24 2004 3cb 0.333
 25 2004 3ac 0.333
 26 2004 3bc 0.333

 Following firststep, project 2 and 4 involved individuals a and b, while
 actually c and d were involved. It seems that there is something going 
 wrong
 in transforming the data.

 Then going to the final result, a list is generated of years and sums of 
 v,
 rather than a list of projects and sums of v. Probably I haven't been 
 clear
 enough: I want to produce a list of all projects and the familiarity of 
 all
 project members involved right before the start of the project.

 Example
 project_id  familiarity
 4  0.25

 Members c and d were jointly involved in 3 projects: 1,2,4. Project 4 took
 place in 2001, so only project 1 took place before that (1999 (project 2
 took place in the same year and is therefore not included). The average
 familiarity between the members in project 1 was 1/4, so:

 project_id  familiarity
 4  0.25

 Thanks!


 Matthew Dowle wrote:


 Thanks for the attempt and required output. How about this?

 firststep = DT[,cbind(expand.grid(B,B),v=1/length(B)),by=C][Var1!=Var2]
 setkey(firststep,Var1,Var2,C)
 firststep = firststep[,transform(.SD,cv=cumsum(v)),by=list(Var1,Var2)]
 setkey(firststep,Var1,Var2,C)
 DT[, {x=data.table(expand.grid(B,B),C[1]-1L)
   firststep[x,roll=TRUE,nomatch=0][,sum(cv)]   # prior familiarity
  },by=C]
 C  V1
 [1,] 1999 0.0
 [2,] 2001 0.5
 [3,] 2004 2.5

 I think you may have said you have large data. If so, this
 method should be fast. Please let us know how you get on.

 HTH
 Matthew



 On Thu, 17 Feb 2011 23:07:19 -0800, mathijsdevaan wrote:

 OK, for the last step I have tried this (among other things):
 library(data.table)
 DT = data.table(read.table(textConnection(A  B  C 1 1  a  1999
 2 1  b  1999
 3 1  c  1999
 4 1  d  1999
 5 2  c  2001
 6 2  d  2001
 7 3  a  2004
 8 3  b  2004
 9 3  d  2004),head=TRUE,stringsAsFactors=FALSE))

 firststep = DT[,cbind(expand.grid(B,B),v=1/length(B)),by=C][Var1!=Var2]
 setkey(firststep,Var1,Var2)
 list1-firststep[J(expand.grid(DT$B,DT$B),v=1/length(DT$B)),nomatch=0]
 [,sum(v)]
 list1
 #27

 What I would like to get:
 list
 1  0
 2  0.5
 3  2.5

 Thanks!

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.



 -- 
 View this message in context: 
 http://r.789695.n4.nabble.com/Re-Transforming-relational-data-tp3307449p3318939.html
 Sent from the R help mailing list archive at Nabble.com.


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Transforming relational data

2011-02-22 Thread Matthew Dowle

Thanks. How about this?

DT$B = factor(DT$B)
firststep = DT[,cbind(expand.grid(B,B),v=1/length(B),C=C[1]),by=A][Var1!
=Var2]
setkey(firststep,Var1,Var2,C)
firststep = firststep[,transform(.SD,cv=cumsum(v)),by=list(Var1,Var2)]
setkey(firststep,Var1,Var2,C)
DT[, {x=data.table(expand.grid(B,B),C[1]-1L)
  firststep[x,roll=TRUE,nomatch=0][,sum(cv)]   # prior familiarity
 },by=A]
 A  V1
[1,] 1 0.0
[2,] 2 0.5
[3,] 3 1.5
[4,] 4 0.5


On Tue, 22 Feb 2011 05:02:05 -0800, mathijsdevaan wrote:

 The output for the new example should be:
 
 project  v
 1  0
 2  0.5
 3  1.5
 4  0.5
 
 The output you calculated was correct for the v per year, but the v per
 group would be incorrect. I think the problem lies in the fact that
 expand.grid(B,B) doesn't take into account that combinations of B can
 only be formed within A. Thanks again!

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Transforming relational data

2011-02-21 Thread Matthew Dowle

Thanks for the attempt and required output. How about this?

firststep = DT[,cbind(expand.grid(B,B),v=1/length(B)),by=C][Var1!=Var2]
setkey(firststep,Var1,Var2,C)
firststep = firststep[,transform(.SD,cv=cumsum(v)),by=list(Var1,Var2)]
setkey(firststep,Var1,Var2,C)
DT[, {x=data.table(expand.grid(B,B),C[1]-1L)
  firststep[x,roll=TRUE,nomatch=0][,sum(cv)]   # prior familiarity
 },by=C]
C  V1
[1,] 1999 0.0
[2,] 2001 0.5
[3,] 2004 2.5

I think you may have said you have large data. If so, this
method should be fast. Please let us know how you get on.

HTH
Matthew



On Thu, 17 Feb 2011 23:07:19 -0800, mathijsdevaan wrote:

 OK, for the last step I have tried this (among other things):
 library(data.table)
 DT = data.table(read.table(textConnection(A  B  C 1 1  a  1999
 2 1  b  1999
 3 1  c  1999
 4 1  d  1999
 5 2  c  2001
 6 2  d  2001
 7 3  a  2004
 8 3  b  2004
 9 3  d  2004),head=TRUE,stringsAsFactors=FALSE))
 
 firststep = DT[,cbind(expand.grid(B,B),v=1/length(B)),by=C][Var1!=Var2]
 setkey(firststep,Var1,Var2)
 list1-firststep[J(expand.grid(DT$B,DT$B),v=1/length(DT$B)),nomatch=0]
[,sum(v)]
 list1
 #27
 
 What I would like to get:
 list
 1  0
 2  0.5
 3  2.5
 
 Thanks!

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Transforming relational data

2011-02-17 Thread Matthew Dowle

Mathijs,

To my eyes you seem to have repeated back what is already done.

More R and less English would help. In other words if it is not 2.5
you need, what is it? Please provide some input and state what the
output should be (and what you tried already).

Matthew

-- 
View this message in context: 
http://r.789695.n4.nabble.com/Re-Transforming-relational-data-tp3307449p3311954.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] boot.ci error with large data sets

2011-02-16 Thread Matthew Dowle

Hello Lars, (cc'd)

Did you ask maintainer(boot) first, as requested by the posting guide?

If you did, but didn't hear back, then please say so, so that we know
you did follow the guide. That maintainer is particularly active, and
particularly efficient though, so I doubt you didn't hear back.

We can tell it's your first post to r-help, and we can tell you have at
least read the posting guide and done very well in following almost all
of it. I can't see anything else wrong with your post (and the subject
line is good) ... other than where you sent it :-)

Matthew


Lars Dalby lars.da...@gmail.com wrote in message 
news:fef4d63e-90f6-43aa-90a6-872792faa...@s11g2000yqc.googlegroups.com...
 Dear List

 I have run into some problems with boot.ci from package boot. When I
 try to obtain a confidence interval of type bca, boot.ci() returns the
 following error when the data set i large:
 Error in bca.ci(boot.out, conf, index[1L], L = L, t = t.o, t0 =
 t0.o,  :
  estimated adjustment 'a' is NA

 Below is an example that produces the above mentioned error on my
 machine.

 library(boot)
 #The wrapper function:
 w.mean - function(x, d) {
 E - x[d,]
 return(weighted.mean(E$A, E$B))}
 #Some fake data:
 test - data.frame(rnorm(1000, 5), rnorm(1000, 3))
 test1 - data.frame(rnorm(1, 5), rnorm(1, 3))
 names(test) - c(A, B)
 names(test1) - c(A, B)
 # Getting the boot object and the CI, seem to works fine
 bootout - boot(test, w.mean, R=1000, stype=i)
 (bootci - boot.ci(bootout, conf = 0.95, type = bca))
 # Now with a bigger data set, boot.ci returns an error.
 bootout1 - boot(test1, w.mean, R=1000, stype=i)
 (bootci1 - boot.ci(bootout1, conf = 0.95, type = bca))

 Does anyone have an idea as to why this happens? (Session info below)

 Best,
 Lars

 sessionInfo()
 R version 2.12.1 (2010-12-16)
 Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)

 locale:
 [1] da_DK.UTF-8/da_DK.UTF-8/C/C/da_DK.UTF-8/da_DK.UTF-8

 attached base packages:
 [1] stats graphics  grDevices utils datasets  methods
 base

 other attached packages:
 [1] boot_1.2-43

 loaded via a namespace (and not attached):
 [1] tools_2.12.1


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Transforming relational data

2011-02-15 Thread Matthew Dowle

Hello. One (of many) solution might be:

require(data.table)
DT = data.table(read.table(textConnection(A  B  C
1 1  a  1999
2 1  b  1999
3 1  c  1999
4 1  d  1999
5 2  c  2001
6 2  d  2001),head=TRUE,stringsAsFactors=FALSE))

firststep = DT[,cbind(expand.grid(B,B),v=1/length(B)),by=C][Var1!=Var2]
setkey(firststep,Var1,Var2)
grp3 = c(a,b,d)
firststep[J(expand.grid(grp3,grp3)),nomatch=0][,sum(v)]
# 2.5

If I guess the bigger picture correctly, this can be extended
to make a time series of prior familiarity by including
the year in the key.

If you decide to try this, please make sure to grab the latest
(recent) version of data.table from CRAN (v1.5.3). Suggest that
you run it first to confirm it does return 2.5, then break it
down and run it step by step to see how each part works. You
will need some time to read the vignettes and ?data.table
(which has recently been improved) but I hope you think it is
worth it. Support is available at maintainer(data.table).

HTH
Matthew


On Mon, 14 Feb 2011 09:22:12 -0800, mathijsdevaan wrote:
 Hi,
 
 I have a large dataset with info on individuals (B) that have been
 involved in projects (A) during multiple years (C). The dataset contains
 three columns: A, B, C. Example:

A  B  C
 1 1  a  1999
 2 1  b  1999
 3 1  c  1999
 4 1  d  1999
 5 2  c  2001
 6 2  d  2001
 7 3  a  2004
 8 3  c  2004
 9 3  d  2004
 
 I am interested in how well all the individuals in a project know each
 other. To calculate this team familiarity measure I want to sum the
 familiarity between all individual pairs in a team. The familiarity
 between each individual pair in a team is calculated as the summation of
 each pair's prior co-appearance in a project divided by the total number
 of team members. So the team familiarity in project 3 = (1/4+1/4) +
 (1/4+1/4+1/2) + (1/4+1/4+1/2) = 2,5 or a has been in project 1 (of size
 4) with c and d  1/4+1/4 and c has been in project 1 (of size 4) with 1
 and d  1/4+1/4 and c has been in project 2 (of size 2) with d  1/2.
 
 I think that the best way to do it is to transform the data into an
 edgelist (each pair in one row/two columns) and then creating two
 additional columns for the strength of the familiarity and the year of
 the project in which the pair was active. The problem is that I am stuck
 already in the first step. So the question is: how do I go from the
 current data structure to a list of projects and the familiarity of its
 team members?
 
 Your help is very much appreciated. Thanks!

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Convert the output of by() to a data frame

2011-02-08 Thread Matthew Dowle


There's a much shorter way.  You don't need that ugly h() with all those $
and potential for bugs !

Using the original f :

dt[,lapply(.SD,f),by=key(dt)]

   grp1 grp2 grp3 a  b  d
  xxx  1.00  81.00 161.00
  xxx 10.00  90.00 170.00
  xxx  5.50  85.50 165.50
  xxx  3.027650   3.027650   3.027650
  xxx  1.816590  28.239721  54.662851
  xxy 11.00  91.00 171.00
  xxy 20.00 100.00 180.00
  xxy 15.50  95.50 175.50
  xxy  3.027650   3.027650   3.027650
  xxy  5.119482  31.542612  57.965742
[ snip ]

To get the names included, one (long) way is :

dt[,data.table(sapply(.SD,f),keep.rownames=TRUE),by=key(dt)]

   grp1 grp2 grp3   rn a  b  d
  xxx  min  1.00  81.00 161.00
  xxx  max 10.00  90.00 170.00
  xxx mean  5.50  85.50 165.50
  xxx   sd  3.027650   3.027650   3.027650
  xxx   cv  1.816590  28.239721  54.662851
  xxy  min 11.00  91.00 171.00
  xxy  max 20.00 100.00 180.00
  xxy mean 15.50  95.50 175.50
  xxy   sd  3.027650   3.027650   3.027650
  xxy   cv  5.119482  31.542612  57.965742
[ snip ]

However, for speed on large datasets you can drop the names in f :

f - function(x) c(min(x), max(x), mean(x), sd(x), mean(x)/sd(x))

and put the names in afterwards.

ans = dt[,lapply(.SD,f),by=key(dt)]
ans$labels = c(min,max,mean,sd,cv)
ans
   grp1 grp2 grp3 a  b  d labels
  xxx  1.00  81.00 161.00min
  xxx 10.00  90.00 170.00max
  xxx  5.50  85.50 165.50   mean
  xxx  3.027650   3.027650   3.027650 sd
  xxx  1.816590  28.239721  54.662851 cv
  xxy 11.00  91.00 171.00min
  xxy 20.00 100.00 180.00max
  xxy 15.50  95.50 175.50   mean
  xxy  3.027650   3.027650   3.027650 sd
  xxy  5.119482  31.542612  57.965742 cv
[ snip ]

You don't want all those small pieces of memory for the names to be created
over and over again every time f runs. That's only important for large
datasets, though.

Matthew


-- 
View this message in context: 
http://r.789695.n4.nabble.com/Convert-the-output-of-by-to-a-data-frame-tp3268428p3278326.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] aggregate function - na.action

2011-02-07 Thread Matthew Dowle
Looking at the timings by each stage may help :

   system.time(dt - data.table(dat))
   user  system elapsed
   1.200.281.48
   system.time(setkey(dt, x1, x2, x3, x4, x5, x6, x7, x8))   # sort by the 
 8 columns (one-off)
   user  system elapsed
   4.720.945.67
   system.time(udt - dt[, list(y = sum(y, na.rm = TRUE)), by = 'x1, x2, 
 x3, x4, x5, x6, x7, x8'])
   user  system elapsed
   2.000.212.20 # compared to 11.07s


data.table doesn't have a custom data structure, so it can't be that.
data.table's structure is the same as data.frame i.e. a list of vectors.
data.table inherits from data.frame.  It *is* a data.frame, too.

The reasons it is faster in this example include :
1. Memory is only allocated for the largest group.
2. That memory is re-used for each group.
3. Since the data is ordered contiguously in RAM, the memory is copied over 
in bulk for each group using
memcpy in C, which is faster than a for loop in C. Page fetches are 
expensive; they are minimised.

This is explained in the documentation, in particular the FAQs.  This 
example is quite small, but the
concept scales to larger sizes i.e. the difference widens further as n 
increases.

http://datatable.r-forge.r-project.org/

Matthew


Hadley Wickham had...@rice.edu wrote in message 
news:aanlktim6drfjxqrsqlxof1ut6xr_bshqdbgpktmed...@mail.gmail.com...
 There's definitely something amiss with aggregate() here since similar
 functions from other packages can reproduce your 'control' sum. I expect
 ddply() will have some timing issues because of all the subgrouping in 
 your
 data frame, but data.table did very well and the summaryBy() function in 
 the
 doBy package did OK:

 Well, if you use the right plyr function, it works just fine:

 system.time(count(dat, c(x1, x2, x3, x4, x4, x5, x6,
 x7, x8), y))
 #   user  system elapsed
 #  9.754   1.314  11.073

 Which illustrates something that I've believed for a while about
 data.table - it's not the indexing that speed things up, it's the
 custom data structure.  If you use ddply with data frames, it's slow
 because data frames are slow.  I think the right way to resolve this
 is to to make data frames more efficient, perhaps using some kind of
 mutable interface where necessary for high-performance operations.

 Hadley

 -- 
 Assistant Professor / Dobelman Family Junior Chair
 Department of Statistics / Rice University
 http://had.co.nz/


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] using character vector as input argument to setkey (data.tablepakcage)

2011-02-07 Thread Matthew Dowle

Hi Sean,

Try :
   key(test.dt) = c(a,b)

Btw, the posting guide asks you to contact the maintainer of the package
before r-help. Otherwise r-help would fill up with posts about 2000+
packages (I guess is the reason). In this case maintainer(data.table)
returns datatable-h...@lists.r-forge.r-project.org (cc'd) where you will
be very welcome.

Matthew

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] aggregate function - na.action

2011-02-07 Thread Matthew Dowle

Hi Hadley,

Does FAQ 1.8 answer that ok ?
   Ok, I'm starting to see what data.table is about, but why didn't you 
enhance data.frame in R? Why does it have to be a new package?
   http://datatable.r-forge.r-project.org/datatable-faq.pdf

Matthew


Hadley Wickham had...@rice.edu wrote in message 
news:AANLkTik180p4YmBtR3QUCW7r=fdefxzbxsy3zwtik...@mail.gmail.com...
On Mon, Feb 7, 2011 at 5:54 AM, Matthew Dowle mdo...@mdowle.plus.com 
wrote:
 Looking at the timings by each stage may help :

 system.time(dt - data.table(dat))
 user system elapsed
 1.20 0.28 1.48
 system.time(setkey(dt, x1, x2, x3, x4, x5, x6, x7, x8)) # sort by the
 8 columns (one-off)
 user system elapsed
 4.72 0.94 5.67
 system.time(udt - dt[, list(y = sum(y, na.rm = TRUE)), by = 'x1, x2,
 x3, x4, x5, x6, x7, x8'])
 user system elapsed
 2.00 0.21 2.20 # compared to 11.07s


 data.table doesn't have a custom data structure, so it can't be that.
 data.table's structure is the same as data.frame i.e. a list of vectors.
 data.table inherits from data.frame. It *is* a data.frame, too.

 The reasons it is faster in this example include :
 1. Memory is only allocated for the largest group.
 2. That memory is re-used for each group.
 3. Since the data is ordered contiguously in RAM, the memory is copied 
 over
 in bulk for each group using
 memcpy in C, which is faster than a for loop in C. Page fetches are
 expensive; they are minimised.

But this is exactly what I mean by a custom data structure - you're
not using the usual data frame API.

Wouldn't it be better to implement these changes to data frame so that
everyone can benefit? Or is it just too specialised to this particular
case (where I guess you're using that the return data structure of the
summary function is consistent)?

Hadley


-- 
Assistant Professor / Dobelman Family Junior Chair
Department of Statistics / Rice University
http://had.co.nz/

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] aggregate function - na.action

2011-02-07 Thread Matthew Dowle

Hadley,

That's fine; please do. I'm happy to explain it offline where the 
documentation or comments in the
code aren't sufficient. It's GPL code so you can take it and improve it, or 
depend on it.
Whatever works for you. As long as (of course) you don't stand on it's 
shoulders and then
restrict users' freedoms (not that I'd ever think you'd do that).

One thing that did make it into R was the improvement to unique.c in R 
2.12.0.

Another that we hope happens one day is changing duplicate.c to use memcpy.
That would automatically benefit all users anywhere R copies data (including 
data.frame).
That wasn't our idea; that's been a FIXME in the R source for many years. 
See thread
on r-devel a while back (search for duplicate.c in subject). It probably 
just needs someone
to send a working patch file that passes checks. That's an example of 
something in the
data.table C code that (hopefully) will make it into base R.

Matthew


Hadley Wickham had...@rice.edu wrote in message 
news:AANLkTi=setpquiyr1+avb4-ga1-fyh9uffa6mskk+...@mail.gmail.com...
 Does FAQ 1.8 answer that ok ?
 Ok, I'm starting to see what data.table is about, but why didn't you
 enhance data.frame in R? Why does it have to be a new package?
 http://datatable.r-forge.r-project.org/datatable-faq.pdf

Kind of.  I think there are two sets of features data.table provides:

 * a compact syntax for expressing many common data manipulations
 * high performance data manipulation

FAQ 1.8 answers the question for the syntax, but not for the
performance related features.

Basically, I'd love to be able to use the high performance components
of data table in plyr, but keep using my existing syntax.  Currently
the only way to do that is for me to dig into your C code to
understand why it's fast, and then implement those ideas in plyr.

Hadley

-- 
Assistant Professor / Dobelman Family Junior Chair
Department of Statistics / Rice University
http://had.co.nz/

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Counting number of rows with two criteria in dataframe

2011-01-26 Thread Matthew Dowle

Note that a key is not actually required, so it's even simpler syntax :

dX = as.data.table(X)
dX[,length(unique(z)),by=x,y]
 x y V1
[1,] 1 1  2
[2,] 1 2  2
[3,] 2 3  2
[4,] 2 4  2
[5,] 3 5  2
[6,] 3 6  2

or passing list() syntax to the 'by' is exactly the same :

dX[,length(unique(z)),by=list(x,y)]

The advantage of using the list() form is you can group by expressions
of columns, for example if x was a date column :

dX[,length(unique(z)),by=list(month(x),y)]

Matthew


Dennis Murphy djmu...@gmail.com wrote in message 
news:AANLkTi=8tysrrfzfm01m7fpzydh-cls-j-cmbkakj...@mail.gmail.com...
 Hi:

 Here are two more candidates, using the plyr and data.table packages:

 library(plyr)
 ddply(X, .(x, y), function(d) length(unique(d$z)))
  x y V1
 1 1 1  2
 2 1 2  2
 3 2 3  2
 4 2 4  2
 5 3 5  2
 6 3 6  2

 The function counts the number of unique z values in each sub-data frame
 with the same x and y values. The argument d in the anonymous function is 
 a
 data frame object.

 # data.table version:

 library(data.table)
 dX - data.table(X, key = 'x, y')
 dX[, list(nz = length(unique(z))), by = 'x, y']
 x y nz
 [1,] 1 1  2
 [2,] 1 2  2
 [3,] 2 3  2
 [4,] 2 4  2
 [5,] 3 5  2
 [6,] 3 6  2

 The key columns sort the data by x, y combinations and then find nz in 
 each
 data subset.

 If you intend to do a lot of summarization/data manipulation in R, these
 packages are worth learning.

 HTH,
 Dennis

 On Tue, Jan 25, 2011 at 11:25 AM, Ryan Utz utz.r...@gmail.com wrote:

 Hi R-users,

 I'm trying to find an elegant way to count the number of rows in a
 dataframe
 with a unique combination of 2 values in the dataframe. My data is
 specifically one column with a year, one with a month, and one with a 
 day.
 I'm trying to count the number of days in each year/month combination. 
 But
 for simplicity's sake, the following dataset will do:

 x-c(1,1,1,1,2,2,2,2,3,3,3,3)
 y-c(1,1,2,2,3,3,4,4,5,5,6,6)
 z-c(1,2,3,4,5,6,7,8,9,10,11,12)
 X-data.frame(x y z)

 So with dataset X, how would I count the number of z values (3rd column 
 in
 X) with unique combinations of the first two columns (x and y)? (for
 instance, in the above example, there are 2 instances per unique
 combination
 of the first two columns). I can do this in Matlab and it's easy, but 
 since
 I'm new to R this is royally stumping me.

 Thanks,
 Ryan

 --
 Ryan Utz
 Postdoctoral research scholar
 University of California, Santa Barbara
 (724) 272 7769

[[alternative HTML version deleted]]

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.


 [[alternative HTML version deleted]]


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] subsets

2011-01-23 Thread Matthew Dowle


require(data.table)
DT = as.data.table(df)

# 1. Patients with ah and ihd
DT[,.SD[ah%in%diagnosis  ihd%in%diagnosis],by=id]

 id diagnosis
[1,]  2ah
[2,]  2   ihd
[3,]  2im
[4,]  4ah
[5,]  4   ihd
[6,]  4angina

# 2. Patients with ah but no ihd
DT[,.SD[ah%in%diagnosis  !ihd%in%diagnosis],by=id]

 id diagnosis
[1,]  1ah
[2,]  3ah
[3,]  3stroke


# 3. Patients with  ihd but no ah?
DT[,.SD[!ah%in%diagnosis  ihd%in%diagnosis],by=id]

 id diagnosis
[1,]  5   ihd
 
-- 
View this message in context: 
http://r.789695.n4.nabble.com/subsets-tp3227143p3233177.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Listing of available functions

2011-01-04 Thread Matthew Dowle
Try :
objects(package:base)

Also, as it happens, a new package called unknownR is in
development on R-Forge.

It's description says :
  Do you know how many functions there are in base R?
  How many of them do you know you don't know?
  Run unk() to discover your unknown unknowns.
  It's fast and it's fun !

It's not ready to try yet (and may not live up to it's promises)
but hopefully should be ready soon.

Matthew


Sébastien Bihorel pomc...@free.fr wrote in message 
news:aanlktinfpmthb2osgjckeo3jwsqhw+-zdyd0xtdmk...@mail.gmail.com...
 Dear R-users,

 Is there a easy way to access to a complete listing of available functions
 from a R session? The help.start() and ? functions are great, but I feel
 like they require the user to know the answer in advance (especially with
 respect to function names)... I could not find a easy way to simply browse
 through a list of functions and randomly pick one function to see what is
 does.

 Is there such a possibility in R?

 Thanks

 PS: I apologize if this question appears trivial.

 [[alternative HTML version deleted]]


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] RGL crashes

2010-12-09 Thread Matthew Dowle
Weyland is the project to remove X11 from Linux.

http://en.wikipedia.org/wiki/Wayland_(display_server)

Ubuntu chiefs have said they support Weyland and aim to include it
in the next release (April 2011 == version 11.04 == Natty Narwhal).

Fedora developers apparenly said that they are likely to adopt
Weyland too.

I don't know if packages in R such as rgl would need changing to
work with Weyland, or perhaps R itself, if at all. However it seems
that Linux is moving away from X11.

Mentioned it here because the issue in this thread appears to be
X11 specific. X11's days seem to be numbered if I understand
correctly.

Matthew


Duncan Murdoch murdoch.dun...@gmail.com wrote in message 
news:4cffca13.7070...@gmail.com...
 Matthew Dowle wrote:
 Might Wayland fix it in Narwhal ?

 I hope those names mean something to Rainer, because they mean nothing to 
 me.

 Duncan Murdoch


 Duncan Murdoch murdoch.dun...@gmail.com wrote in message 
 news:4cff7177.7030...@gmail.com...
 On 08/12/2010 6:07 AM, Rainer M Krug wrote:
 -BEGIN PGP SIGNED MESSAGE-
 Hash: SHA1

 On 12/08/2010 12:05 PM, Duncan Murdoch wrote:
 Rainer M Krug wrote:
 Hi

 rgl crashes my R session, when resizing the rgl graphic window.

 I am using Ubuntu Maversick, with dual monitor setup. If I disconnect
 one monitor, I can resize it a little bit, but it still craches if I
 enlarge it to much.

 I assume that the problem has to do with allocated graphic memory in 
 the
 kernel, but why is R crashing completely, and not evn giving the usual
 crash options?

 Cheers,

 Rainer


 sessionInfo()
 R version 2.12.0 (2010-10-15)
 Platform: i686-pc-linux-gnu (32-bit)

 locale:
   [1] LC_CTYPE=en_US.utf8   LC_NUMERIC=C
   [3] LC_TIME=en_US.utf8LC_COLLATE=en_US.utf8
   [5] LC_MONETARY=C LC_MESSAGES=en_US.utf8
   [7] LC_PAPER=en_US.utf8   LC_NAME=C
   [9] LC_ADDRESS=C  LC_TELEPHONE=C
 [11] LC_MEASUREMENT=en_US.utf8 LC_IDENTIFICATION=C

 attached base packages:
 [1] stats graphics  grDevices utils datasets  methods   base

 other attached packages:
 [1] rgl_0.92.794
 version
 _
 platform   i686-pc-linux-gnu
 arch   i686
 os linux-gnu
 system i686, linux-gnu
 status
 major  2
 minor  12.0
 year   2010
 month  10
 day15
 svn rev53317
 language   R
 version.string R version 2.12.0 (2010-10-15)

 After executing

 library(rgl)
 example(rgl)

 and resizing the graph window, R crashes witrh the following message:

 drmRadeonCmdBuffer: -22. Kernel failed to parse or rejected command
 stream. See dmesg for more info.

 from dmesg:

 [ 7349.471959] [drm:r100_cs_track_check] *ERROR* [drm] Buffer too 
 small
 for color buffer 0 (need 413696 have 262144) !
 [ 7349.471964] [drm:r100_cs_track_check] *ERROR* [drm] color buffer 0
 (256 4 0 404)
 [ 7349.471967] [drm:radeon_cs_ioctl] *ERROR* Invalid command stream !

 Those messages look like they're coming from your graphics driver, 
 not
 from R.  So rgl may be doing something it shouldn't do, but you'll
 probably have to diagnose what that is.  It's unlikely to be
 reproducible on another system.
 That's what I fear as well - could you give me any tips on how to
 proceed to identify the problem?
 It might help to know which line of code in rgl actually triggered the 
 error, but debugging X11 code is tricky.  The function that likely 
 triggered the problem is X11WindowImpl::setWindowRect in 
 rgl/src/x11gui.cpp; it makes calls to X11 functions that do the actual 
 work.

 Duncan Murdoch

 Rainer

 Duncan Murdoch


 -- Rainer M. Krug, PhD (Conservation Ecology, SUN), MSc (Conservation
 Biology, UCT), Dipl. Phys. (Germany)

 Centre of Excellence for Invasion Biology
 Natural Sciences Building
 Office Suite 2039
 Stellenbosch University
 Main Campus, Merriman Avenue
 Stellenbosch
 South Africa

 Tel:+33 - (0)9 53 10 27 44
 Cell:   +27 - (0)8 39 47 90 42
 Fax (SA):   +27 - (0)8 65 16 27 82
 Fax (D) :   +49 - (0)3 21 21 25 22 44
 Fax (FR):   +33 - (0)9 58 10 27 44
 email:  rai...@krugs.de

 Skype:  RMkrug
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.

 - --
 Rainer M. Krug, PhD (Conservation Ecology, SUN), MSc (Conservation
 Biology, UCT), Dipl. Phys. (Germany)

 Centre of Excellence for Invasion Biology
 Natural Sciences Building
 Office Suite 2039
 Stellenbosch University
 Main Campus, Merriman Avenue
 Stellenbosch
 South Africa

 Tel:+33 - (0)9 53 10 27 44
 Cell:   +27 - (0)8 39 47 90 42
 Fax (SA):   +27 - (0)8 65 16 27 82
 Fax (D) :   +49 - (0)3 21 21 25 22 44
 Fax (FR):   +33 - (0)9 58 10 27 44
 email:  rai...@krugs.de

 Skype:  RMkrug
 -BEGIN PGP SIGNATURE-
 Version: GnuPG v1.4.10 (GNU/Linux

Re: [R] RGL crashes

2010-12-08 Thread Matthew Dowle
Might Wayland fix it in Narwhal ?

Duncan Murdoch murdoch.dun...@gmail.com wrote in message 
news:4cff7177.7030...@gmail.com...
 On 08/12/2010 6:07 AM, Rainer M Krug wrote:
 -BEGIN PGP SIGNED MESSAGE-
 Hash: SHA1

 On 12/08/2010 12:05 PM, Duncan Murdoch wrote:
 Rainer M Krug wrote:
 Hi

 rgl crashes my R session, when resizing the rgl graphic window.

 I am using Ubuntu Maversick, with dual monitor setup. If I disconnect
 one monitor, I can resize it a little bit, but it still craches if I
 enlarge it to much.

 I assume that the problem has to do with allocated graphic memory in the
 kernel, but why is R crashing completely, and not evn giving the usual
 crash options?

 Cheers,

 Rainer


 sessionInfo()
 R version 2.12.0 (2010-10-15)
 Platform: i686-pc-linux-gnu (32-bit)

 locale:
   [1] LC_CTYPE=en_US.utf8   LC_NUMERIC=C
   [3] LC_TIME=en_US.utf8LC_COLLATE=en_US.utf8
   [5] LC_MONETARY=C LC_MESSAGES=en_US.utf8
   [7] LC_PAPER=en_US.utf8   LC_NAME=C
   [9] LC_ADDRESS=C  LC_TELEPHONE=C
 [11] LC_MEASUREMENT=en_US.utf8 LC_IDENTIFICATION=C

 attached base packages:
 [1] stats graphics  grDevices utils datasets  methods   base

 other attached packages:
 [1] rgl_0.92.794
 version
 _
 platform   i686-pc-linux-gnu
 arch   i686
 os linux-gnu
 system i686, linux-gnu
 status
 major  2
 minor  12.0
 year   2010
 month  10
 day15
 svn rev53317
 language   R
 version.string R version 2.12.0 (2010-10-15)

 After executing

 library(rgl)
 example(rgl)

 and resizing the graph window, R crashes witrh the following message:

 drmRadeonCmdBuffer: -22. Kernel failed to parse or rejected command
 stream. See dmesg for more info.

 from dmesg:

 [ 7349.471959] [drm:r100_cs_track_check] *ERROR* [drm] Buffer too small
 for color buffer 0 (need 413696 have 262144) !
 [ 7349.471964] [drm:r100_cs_track_check] *ERROR* [drm] color buffer 0
 (256 4 0 404)
 [ 7349.471967] [drm:radeon_cs_ioctl] *ERROR* Invalid command stream !

 Those messages look like they're coming from your graphics driver, not
 from R.  So rgl may be doing something it shouldn't do, but you'll
 probably have to diagnose what that is.  It's unlikely to be
 reproducible on another system.

 That's what I fear as well - could you give me any tips on how to
 proceed to identify the problem?

 It might help to know which line of code in rgl actually triggered the 
 error, but debugging X11 code is tricky.  The function that likely 
 triggered the problem is X11WindowImpl::setWindowRect in 
 rgl/src/x11gui.cpp; it makes calls to X11 functions that do the actual 
 work.

 Duncan Murdoch


 Rainer


 Duncan Murdoch



 -- Rainer M. Krug, PhD (Conservation Ecology, SUN), MSc (Conservation
 Biology, UCT), Dipl. Phys. (Germany)

 Centre of Excellence for Invasion Biology
 Natural Sciences Building
 Office Suite 2039
 Stellenbosch University
 Main Campus, Merriman Avenue
 Stellenbosch
 South Africa

 Tel:+33 - (0)9 53 10 27 44
 Cell:   +27 - (0)8 39 47 90 42
 Fax (SA):   +27 - (0)8 65 16 27 82
 Fax (D) :   +49 - (0)3 21 21 25 22 44
 Fax (FR):   +33 - (0)9 58 10 27 44
 email:  rai...@krugs.de

 Skype:  RMkrug

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.

 - --
 Rainer M. Krug, PhD (Conservation Ecology, SUN), MSc (Conservation
 Biology, UCT), Dipl. Phys. (Germany)

 Centre of Excellence for Invasion Biology
 Natural Sciences Building
 Office Suite 2039
 Stellenbosch University
 Main Campus, Merriman Avenue
 Stellenbosch
 South Africa

 Tel:+33 - (0)9 53 10 27 44
 Cell:   +27 - (0)8 39 47 90 42
 Fax (SA):   +27 - (0)8 65 16 27 82
 Fax (D) :   +49 - (0)3 21 21 25 22 44
 Fax (FR):   +33 - (0)9 58 10 27 44
 email:  rai...@krugs.de

 Skype:  RMkrug
 -BEGIN PGP SIGNATURE-
 Version: GnuPG v1.4.10 (GNU/Linux)
 Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

 iEYEARECAAYFAkz/ZuUACgkQoYgNqgF2egoPDwCfYQqfotaTxJ2dkFDMqrVt/Kzr
 /REAmwQIWe2N3iiFxYYjCEcaPYgTx8As
 =VpUe
 -END PGP SIGNATURE-


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] fast subsetting of lists in lists

2010-12-07 Thread Matthew Dowle
Hello Alex,

Assuming it was just an inadequate example (since a data.frame would suffice 
in that case), did you know that a data.frames' columns do not have to be 
vectors but can be lists?  I don't know if that helps.

 DF = data.frame(a=1:3)
 DF$b = list(pi, 2:3, letters[1:5])
 DF
  a b
1 1  3.141593
2 2  2, 3
3 3 a, b, c, d, e
 DF$b
[[1]]
[1] 3.141593

[[2]]
[1] 2 3

[[3]]
[1] a b c d e
 sapply(DF,class)
a b
integerlist


That is still regular though in the sense that each row has a value for all 
the columns, even if that value is NA, or NULL in lists.

If your data is not regular then one option is to flatten it into 
(row,column,value) tuple similar to how sparse matrices are stored.  Your 
value column may be list rather than vector.

Then (and yes you guessed this was coming) ... you can use data.table to 
query the flat structure quickly by setting a key on the first two columns, 
or maybe just the 2nd column when you need to pick out the values for one 
'column' quickly for all 'rows'.

There was a thread about using list() columns in data.table here :

http://r.789695.n4.nabble.com/Suggest-a-cool-feature-Use-data-table-like-a-sorted-indexed-data-list-tp2544213p2544213.html

 Does someone now a trick to do the same as above with the faster built-in 
 subsetting? Something like:
 test[somesubsettingmagic]

So in data.table if you wanted all the 'b' values,  you might do something 
like this :

setkey(DT,column)
DT[J(b), value]

which should return the list() quickly from the irregular data.

Matthew


Alexander Senger sen...@physik.hu-berlin.de wrote in message 
news:4cfe6aee.6030...@physik.hu-berlin.de...
 Hello Gerrit, Gabor,


 thank you for your suggestion.

 Unfortunately unlist seems to be rather expensive. A short test with one
 of my datasets gives 0.01s for an extraction based on my approach and
 5.6s for unlist alone. The reason seems to be that unlist relies on
 lapply internally and does so recursively?

 Maybe there is still another way to go?

 Alex

 Am 07.12.2010 15:59, schrieb Gerrit Eichner:
 Hello, Alexander,

 does

 utest - unlist(test)
 utest[ names( utest) == a]

 come close to what you need?

 Hth,

 Gerrit


 On Tue, 7 Dec 2010, Alexander Senger wrote:

 Hello,


 my data is contained in nested lists (which seems not necessarily to be
 the best approach). What I need is a fast way to get subsets from the
 data.

 An example:

 test - list(list(a = 1, b = 2, c = 3), list(a = 4, b = 5, c = 6),
 list(a = 7, b = 8, c = 9))

 Now I would like to have all values in the named variables a, that is
 the vector c(1, 4, 7). The best I could come up with is:

 val - sapply(1:3, function (i) {test[[i]]$a})

 which is unfortunately not very fast. According to R-inferno this is due
 to the fact that apply and its derivates do looping in R rather than
 rely on C-subroutines as the common [-operator.

 Does someone now a trick to do the same as above with the faster
 built-in subsetting? Something like:

 test[somesubsettingmagic]


 Thank you for your advice


 Alex

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.



__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Performance tuning tips when working with wide datasets

2010-11-24 Thread Matthew Dowle

Richard,

Try data.table. See the introduction vignette and the
presentations e.g. there is a slide showing a join to
183,000,000 observations of daily stock prices in
0.002 seconds.

data.table has fast rolling joins (i.e. fast last observation
carried forward) too. I see you asked about that on
this list on 8 Nov. Also see fast aggregations using 'by'
on a key()-ed in-memory table.

I wonder if your 20,000 columns are always
populated for all rows. If not then consider collapsing
to a 3 column table (row,col,data) and then
joining to that. You may have that format in your
original data source anyway, so you may be able
to skip a step you may have implemented already
which expands that format to wide. In other words,
keeping it narrow may be an option (like how a sparse
matrix is stored).

Matthew

http://datatable.r-forge.r-project.org/



Richard Vlasimsky richard.vlasim...@imidex.com wrote in message 
news:2e042129-4430-4c66-9308-a36b761eb...@imidex.com...

 Does anyone have any performance tuning tips when working with datasets 
 that are extremely wide (e.g. 20,000 columns)?

 In particular, I am trying to perform a merge like below:

 merged_data - merge(data1, data2, 
 by.x=date,by.y=date,all=TRUE,sort=TRUE);

 This statement takes about 8 hours to execute on a pretty fast machine. 
 The dataset data1 contains daily data going back to 1950 (20,000 rows) and 
 has 25 columns.  The dataset data2 contains annual data (only 60 
 observations), however there are lots of columns (20,000 of them).

 I have to do a lot of these kinds of merges so need to figure out a way to 
 speed it up.

 I have tried  a number of different things to speed things up to no avail. 
 I've noticed that rbinds execute much faster using matrices than 
 dataframes.  However the performance improvement when using matrices (vs. 
 data frames) on merges were negligible (8 hours down to 7).  I tried 
 casting my merge field (date) into various different data types 
 (character, factor, date).  This didn't seem to have any effect. I tried 
 the hash package, however, merge couldn't coerce the class into a 
 data.frame.  I've tried various ways to parellelize computation in the 
 past, and found that to be problematic for a variety of reasons (runaway 
 forked processes, doesn't run in a GUI environment, doesn't run on Macs, 
 etc.).

 I'm starting to run out of ideas, anyone?  Merging a 60 row dataset 
 shouldn't take that long.

 Thanks,
 Richard

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Finding the nearest data in intraday data from two zoo objects

2010-11-24 Thread Matthew Dowle
Try data.table with the roll=TRUE argument.

Set your keys and then write :

futData[optData,roll=TRUE]

That is fast and as you can see, short. Works on
many millions and even billions of rows in R.

Matthew

http://datatable.r-forge.r-project.org/


Santosh Srinivas santosh.srini...@gmail.com wrote in message 
news:4ced3783.2af98e0a.57f0.b...@mx.google.com...
 Hello Group,

 I have the following options and future data in zoo objects

 head(optData.z)
   ExpDt OptTyp Strike TrdPrice TotTrdQty
 2009-01-01 09:55:03 20090129  1   2900 180.50
 2009-01-01 09:55:31 20090129  1   2900 188.50
 2009-01-01 09:55:37 20090129  1   2900 185.   500
 2009-01-01 09:55:39 20090129  1   2900 185.   500
 2009-01-01 09:55:47 20090129  1   2900 185.1125   600
 2009-01-01 09:55:48 20090129  1   2900 185.250050

 head(futData.z)
   ExpDt OptTyp Strike TrdPrice TotTrdQty
 2009-01-01 09:55:09 20090129  2  0 2979.000   900
 2009-01-01 09:55:11 20090129  2  0 2976.633   600
 2009-01-01 09:55:12 20090129  2  0 2977.211   900
 2009-01-01 09:55:14 20090129  2  0 2977.750   800
 2009-01-01 09:55:15 20090129  2  0 2977.019  4300
 2009-01-01 09:55:16 20090129  2  0 2977.050   800

 I want to get the closest available futures price for every option ... Is
 there any function like the excel equivalent of approximate VLOOKUP of 
 excel
 using date time?

 Thank you.


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Sorting and subsetting

2010-09-21 Thread Matthew Dowle


All the solutions in this thread so far use the lapply(split(...)) paradigm
either directly or indirectly. That paradigm doesn't scale. That's the
likely
source of quite a few 'out of memory' errors and performance issues in R.

data.table doesn't do that internally, and it's syntax is pretty easy.

 tmp - data.table(index = gl(2,20), foo = rnorm(40))

 tmp[, .SD[head(order(-foo),5)], by=index]
  index index.1   foo
 [1,] 1   1 1.9677303
 [2,] 1   1 1.2731872
 [3,] 1   1 1.1100931
 [4,] 1   1 0.8194719
 [5,] 1   1 0.6674880
 [6,] 2   2 1.2236383
 [7,] 2   2 0.9606766
 [8,] 2   2 0.8654497
 [9,] 2   2 0.5404112
[10,] 2   2 0.3373457
 

As you can see it currently repeats the group column which is a
shame (on the to do list to fix).

Matthew

http://datatable.r-forge.r-project.org/


-- 
View this message in context: 
http://r.789695.n4.nabble.com/Sorting-and-subsetting-tp2547360p2548319.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Sorting and subsetting

2010-09-21 Thread Matthew Dowle

Probably true, thats cunning, but look at base::match. The
first thing it does is coerce factor to character (an allocate
and copy needed internally). data.table doesn't do that
either, see data.table:::sortedmatch.

I made first basic steps towards a proper reproducible test
suite (timings.Rnw). Perhaps this example could be
added there; PDF is on the homepage. One test is 340
times faster and the other is 13 times faster. More
examples would be good.

Matthew
http://datatable.r-forge.r-project.org/


Joshua Wiley jwiley.ps...@gmail.com wrote in message 
news:aanlktimyuvl9suj65ktzqvpnyn+ep8ubu3mxxhhrd...@mail.gmail.com...
 On Tue, Sep 21, 2010 at 3:09 AM, Matthew Dowle mdo...@mdowle.plus.com 
 wrote:


 All the solutions in this thread so far use the lapply(split(...)) 
 paradigm
 either directly or indirectly. That paradigm doesn't scale. That's the
 likely
 source of quite a few 'out of memory' errors and performance issues in R.

 This is a good point.  It is not nearly as straightforward as the
 syntax for data.table (which seems to order and select in one
 step...very nice!), but this should be less memory intensive:

 tmp - data.frame(index = gl(2,20), foo = rnorm(40))
 tmp - tmp[order(tmp$index, tmp$foo) , ]

 # find location of first instance of each level and add 0:4 to it
 x - sapply(match(levels(tmp$index), tmp$index), `+`, 0:4)

 tmp[x, ]


 data.table doesn't do that internally, and it's syntax is pretty easy.

 tmp - data.table(index = gl(2,20), foo = rnorm(40))

 tmp[, .SD[head(order(-foo),5)], by=index]
 index index.1 foo
 [1,] 1 1 1.9677303
 [2,] 1 1 1.2731872
 [3,] 1 1 1.1100931
 [4,] 1 1 0.8194719
 [5,] 1 1 0.6674880
 [6,] 2 2 1.2236383
 [7,] 2 2 0.9606766
 [8,] 2 2 0.8654497
 [9,] 2 2 0.5404112
 [10,] 2 2 0.3373457


 As you can see it currently repeats the group column which is a
 shame (on the to do list to fix).

 Matthew

 http://datatable.r-forge.r-project.org/


 --
 View this message in context: 
 http://r.789695.n4.nabble.com/Sorting-and-subsetting-tp2547360p2548319.html
 Sent from the R help mailing list archive at Nabble.com.

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide 
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.




 -- 
 Joshua Wiley
 Ph.D. Student, Health Psychology
 University of California, Los Angeles
 http://www.joshuawiley.com/

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide 
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Sorting and subsetting

2010-09-21 Thread Matthew Dowle

See data.table:::duplist which does that (or at least very similar) in C,
for multiple columns too.

Matthew
http://datatable.r-forge.r-project.org/


peter dalgaard pda...@gmail.com wrote in message 
news:660991c3-b52b-4d58-b819-eadc95ecc...@gmail.com...

 On Sep 21, 2010, at 16:27 , Joshua Wiley wrote:

 On Tue, Sep 21, 2010 at 3:09 AM, Matthew Dowle mdo...@mdowle.plus.com 
 wrote:


 All the solutions in this thread so far use the lapply(split(...)) 
 paradigm
 either directly or indirectly. That paradigm doesn't scale. That's the
 likely
 source of quite a few 'out of memory' errors and performance issues in 
 R.

 This is a good point.  It is not nearly as straightforward as the
 syntax for data.table (which seems to order and select in one
 step...very nice!), but this should be less memory intensive:

 tmp - data.frame(index = gl(2,20), foo = rnorm(40))
 tmp - tmp[order(tmp$index, tmp$foo) , ]

 # find location of first instance of each level and add 0:4 to it
 x - sapply(match(levels(tmp$index), tmp$index), `+`, 0:4)

 tmp[x, ]


 That will get you in trouble if any group has size less than 5, though.

 Something involving duplicated() could work; you just need to generate 
 the sawtooth sequence: 0,1,2,3,4,0,1,2,3,4,5,6,0,1,2,... and select values 
 less than or equal 4. I _think_ this should work (it does on the 
 airquality dataframe, anyway):

 ix - tmp$index

 s - seq_along(ix)
 j - diff(s[!duplicated(ix)])
 s2 - rep.int(0, length(s))
 s2[!duplicated(ix)] - c(1,j)
 d - s - cumsum(s2)

 tmp[d  5,]

 Or, another version of the same idea, giving teeth starting at 1 instead

 d - s - c(0,cumsum(table(ix)))[factor(ix)]
 tmp[d = 5, ]



 (There are times when I contemplate writing a DATAstep() function, this is 
 one of those things that are straightforward in the SAS sequential 
 processing paradigm. Of course there are things that are much more 
 complicated in SAS, too.)



 data.table doesn't do that internally, and it's syntax is pretty easy.

 tmp - data.table(index = gl(2,20), foo = rnorm(40))

 tmp[, .SD[head(order(-foo),5)], by=index]
  index index.1   foo
  [1,] 1   1 1.9677303
  [2,] 1   1 1.2731872
  [3,] 1   1 1.1100931
  [4,] 1   1 0.8194719
  [5,] 1   1 0.6674880
  [6,] 2   2 1.2236383
  [7,] 2   2 0.9606766
  [8,] 2   2 0.8654497
  [9,] 2   2 0.5404112
 [10,] 2   2 0.3373457


 As you can see it currently repeats the group column which is a
 shame (on the to do list to fix).

 Matthew

 http://datatable.r-forge.r-project.org/


 --
 View this message in context: 
 http://r.789695.n4.nabble.com/Sorting-and-subsetting-tp2547360p2548319.html
 Sent from the R help mailing list archive at Nabble.com.

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide 
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.




 -- 
 Joshua Wiley
 Ph.D. Student, Health Psychology
 University of California, Los Angeles
 http://www.joshuawiley.com/

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide 
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.

 -- 
 Peter Dalgaard
 Center for Statistics, Copenhagen Business School
 Solbjerg Plads 3, 2000 Frederiksberg, Denmark
 Phone: (+45)38153501
 Email: pd@cbs.dk  Priv: pda...@gmail.com


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Pass By Value Questions

2010-08-20 Thread Matthew Dowle


To: r-help
Cc: Jeff, Matt, Duncan, Hadley   [ using Nabble to cc ]

Jeff, Matt,

How about the 'refdata' class in package ref.
Also, Hadley's immutable data.frame in plyr 1.1.

Both allow you to refer to subsets of a data.frame or matrix by reference I
believe, if I understand correctly.

Matthew

http://datatable.r-forge.r-project.org/



-- 
View this message in context: 
http://r.789695.n4.nabble.com/Pass-By-Value-Questions-tp2331565p2332330.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] coef(summary) and plyr

2010-08-09 Thread Matthew Dowle


Another option for consideration :

library(data.table)
mydt = as.data.table(mydf)

mydt[,as.list(coef(lm(y~x1+x2+x3))),by=fac]
 fac X.Intercept.   x1   x2x3
[1,]   0  -0.16247059 1.130220 2.988769 -19.14719
[2,]   1   0.08224509 1.216673 2.847960 -19.16105
[3,]   2   0.02052320 1.135421 3.134154 -19.22555

mydt[,data.table(coef(summary(lm(y~x1+x2+x3))),keep.rownames=TRUE),  by=fac]
 fac  rn Estimate Std..Error  t.value Pr...t..
[1,]   0 (Intercept)  -0.16247059  0.1521507   -1.0678269 2.929087e-01
[2,]   0  x1   1.13021985  0.13740208.2256414 1.079035e-09
[3,]   0  x2   2.98876920  0.1404903   21.2738533 1.325909e-21
[4,]   0  x3 -19.14719151  0.1335139 -143.4096890 4.520371e-50
[5,]   1 (Intercept)   0.08224509  0.23606640.3483981 7.313719e-01
[6,]   1  x1   1.21667349  0.27232014.4678058 2.637743e-04
[7,]   1  x2   2.84796003  0.2232960   12.7541904 9.192555e-11
[8,]   1  x3 -19.16104669  0.2394431  -80.0233818 1.707058e-25
[9,]   2 (Intercept)   0.02052320  0.19025260.1078734 9.147302e-01
[10,]   2  x1   1.13542085  0.17863336.3561559  2.980475e-07
[11,]   2  x2   3.13415398  0.1894404   16.5442781  7.827178e-18
[12,]   2  x3 -19.22554984  0.1708307 -112.5415605  2.536686e-45

http://datatable.r-forge.r-project.org/

Matthew




-- 
View this message in context: 
http://r.789695.n4.nabble.com/coef-summary-and-plyr-tp2318460p2319068.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Finding points where two timeseries cross over

2010-08-04 Thread Matthew Dowle

Is this what you mean?

x=c(1,2,2,3,4,5,6,3,2,1)
y=c(2,3,4,2,1,2,3,4,5,6)
matplot(cbind(x,y),type=l)
which(diff(sign(x-y))!=0)+1
[1] 4 8

-- 
View this message in context: 
http://r.789695.n4.nabble.com/Finding-points-where-two-timeseries-cross-over-tp2313257p2313510.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] long to wide on larger data set

2010-07-12 Thread Matthew Dowle
Juliet,

I've been corrected off list. I did not read properly that you are on 64bit.

The calculation should be :
53860858 * 4 * 8 /1024^3 = 1.6GB
since pointers are 8 bytes on 64bit.

Also, data.table is an add-on package so I should have included :

   install.packages(data.table)
   require(data.table)

data.table is available on all platforms both 32bit and 64bit.

Please forgive mistakes: 'someoone' should be 'someone', 'percieved' should 
be
'perceived' and 'testDate' should be 'testData' at the end.

The rest still applies, and you might have a much easier time than I thought
since you are on 64bit. I was working on the basis of squeezing into 32bit.

Matthew


Matthew Dowle mdo...@mdowle.plus.com wrote in message 
news:i1faj2$lv...@dough.gmane.org...

 Hi Juliet,

 Thanks for the info.

 It is very slow because of the == in  testData[testData$V2==one_ind,]

 Why? Imagine someoone looks for 10 people in the phone directory. Would
 they search the entire phone directory for the first person's phone 
 number, starting
 on page 1, looking at every single name, even continuing to the end of the 
 book
 after they had found them ?  Then would they start again from page 1 for 
 the 2nd
 person, and then the 3rd, searching the entire phone directory from start 
 to finish
 for each and every person ?  That code using == does that.  Some of us 
 call
 that a 'vector scan' and is a common reason for R being percieved as slow.

 To do that more efficiently try this :

 testData = as.data.table(testData)
 setkey(testData,V2)# sorts data by V2
 for (one_ind in mysamples) {
   one_sample - testData[one_id,]
   reshape(one_sample)
 }

 or just this :

 testData = as.data.table(testData)
 setkey(testDate,V2)
 testData[,reshape(.SD,...), by=V2]

 That should solve the vector scanning problem, and get you on to the 
 memory
 problems which will need to be tackled. Since the 4 columns are character, 
 then
 the object size should be roughly :

53860858 * 4 * 4 /1024^3 = 0.8GB

 That is more promising to work with in 32bit so there is hope. [ That 
 0.8GB
 ignores the (likely small) size of the unique strings in global string 
 hash (depending
 on your data). ]

 Its likely that the as.data.table() fails with out of memory.  That is not 
 data.table
 but unique. There is a change in unique.c in R 2.12 which makes unique 
 more
 efficient and since factor calls unique, it may be necessary to use R 
 2.12.

 If that still doesn't work, then there are several more tricks (and we 
 will need
 further information), and there may be some tweaks needed to that code as 
 I
 didn't test it,  but I think it should be possible in 32bit using R 2.12.

 Is it an option to just keep it in long format and use a data.table ?

   testDate[, somecomplexrfunction(onecolumn, anothercolumn), by=list(V2) ]

 Why you you need to reshape from long to wide ?

 HTH,
 Matthew



 Juliet Hannah juliet.han...@gmail.com wrote in message 
 news:aanlktinyvgmrvdp0svc-fylgogn2ro0omnugqbxx_...@mail.gmail.com...
 Hi Jim,

 Thanks for responding. Here is the info I should have included before.
 I should be able to access 4 GB.

 str(myData)
 'data.frame':   53860857 obs. of  4 variables:
 $ V1: chr  23 26 200047 200050 ...
 $ V2: chr  cv0001 cv0001 cv0001 cv0001 ...
 $ V3: chr  A A A B ...
 $ V4: chr  B B A B ...
 sessionInfo()
 R version 2.11.0 (2010-04-22)
 x86_64-unknown-linux-gnu

 locale:
 [1] LC_CTYPE=en_US.UTF-8   LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=C  LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8   LC_NAME=C
 [9] LC_ADDRESS=C   LC_TELEPHONE=C
 [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

 attached base packages:
 [1] stats graphics  grDevices utils datasets  methods   base

 On Mon, Jul 12, 2010 at 7:54 AM, jim holtman jholt...@gmail.com wrote:
 What is the configuration you are running on (OS, memory, etc.)? What
 does your object consist of? Is it numeric, factors, etc.? Provide a
 'str' of it. If it is numeric, then the size of the object is
 probably about 1.8GB. Doing the long to wide you will probably need
 at least that much additional memory to hold the copy, if not more.
 This would be impossible on a 32-bit version of R.

 On Mon, Jul 12, 2010 at 1:25 AM, Juliet Hannah juliet.han...@gmail.com 
 wrote:
 I have a data set that has 4 columns and 53860858 rows. I was able to
 read this into R with:

 cc - rep(character,4)
 myData - 
 read.table(myData.csv,header=FALSE,skip=1,colClasses=cc,nrow=53860858,sep=,)


 I need to reshape this data from long to wide. On a small data set the
 following lines work. But on the real data set, it didn't finish even
 when I took a sample of two (rows in new data). I didn't receive an
 error. I just stopped it because it was taking too long. Any
 suggestions for improvements? Thanks.

 # start example
 # i have commented out the write.table statement below

 testData - read.table

Re: [R] long to wide on larger data set

2010-07-12 Thread Matthew Dowle

Hi Juliet,

Thanks for the info.

It is very slow because of the == in  testData[testData$V2==one_ind,]

Why? Imagine someoone looks for 10 people in the phone directory. Would
they search the entire phone directory for the first person's phone number, 
starting
on page 1, looking at every single name, even continuing to the end of the 
book
after they had found them ?  Then would they start again from page 1 for the 
2nd
person, and then the 3rd, searching the entire phone directory from start to 
finish
for each and every person ?  That code using == does that.  Some of us call
that a 'vector scan' and is a common reason for R being percieved as slow.

To do that more efficiently try this :

testData = as.data.table(testData)
setkey(testData,V2)# sorts data by V2
for (one_ind in mysamples) {
   one_sample - testData[one_id,]
   reshape(one_sample)
}

or just this :

testData = as.data.table(testData)
setkey(testDate,V2)
testData[,reshape(.SD,...), by=V2]

That should solve the vector scanning problem, and get you on to the memory
problems which will need to be tackled. Since the 4 columns are character, 
then
the object size should be roughly :

53860858 * 4 * 4 /1024^3 = 0.8GB

That is more promising to work with in 32bit so there is hope. [ That 0.8GB
ignores the (likely small) size of the unique strings in global string hash 
(depending
on your data). ]

Its likely that the as.data.table() fails with out of memory.  That is not 
data.table
but unique. There is a change in unique.c in R 2.12 which makes unique more
efficient and since factor calls unique, it may be necessary to use R 2.12.

If that still doesn't work, then there are several more tricks (and we will 
need
further information), and there may be some tweaks needed to that code as I
didn't test it,  but I think it should be possible in 32bit using R 2.12.

Is it an option to just keep it in long format and use a data.table ?

   testDate[, somecomplexrfunction(onecolumn, anothercolumn), by=list(V2) ]

Why you you need to reshape from long to wide ?

HTH,
Matthew



Juliet Hannah juliet.han...@gmail.com wrote in message 
news:aanlktinyvgmrvdp0svc-fylgogn2ro0omnugqbxx_...@mail.gmail.com...
Hi Jim,

Thanks for responding. Here is the info I should have included before.
I should be able to access 4 GB.

 str(myData)
'data.frame':   53860857 obs. of  4 variables:
 $ V1: chr  23 26 200047 200050 ...
 $ V2: chr  cv0001 cv0001 cv0001 cv0001 ...
 $ V3: chr  A A A B ...
 $ V4: chr  B B A B ...
 sessionInfo()
R version 2.11.0 (2010-04-22)
x86_64-unknown-linux-gnu

locale:
 [1] LC_CTYPE=en_US.UTF-8   LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=C  LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8   LC_NAME=C
 [9] LC_ADDRESS=C   LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats graphics  grDevices utils datasets  methods   base

On Mon, Jul 12, 2010 at 7:54 AM, jim holtman jholt...@gmail.com wrote:
 What is the configuration you are running on (OS, memory, etc.)? What
 does your object consist of? Is it numeric, factors, etc.? Provide a
 'str' of it. If it is numeric, then the size of the object is
 probably about 1.8GB. Doing the long to wide you will probably need
 at least that much additional memory to hold the copy, if not more.
 This would be impossible on a 32-bit version of R.

 On Mon, Jul 12, 2010 at 1:25 AM, Juliet Hannah juliet.han...@gmail.com 
 wrote:
 I have a data set that has 4 columns and 53860858 rows. I was able to
 read this into R with:

 cc - rep(character,4)
 myData - 
 read.table(myData.csv,header=FALSE,skip=1,colClasses=cc,nrow=53860858,sep=,)


 I need to reshape this data from long to wide. On a small data set the
 following lines work. But on the real data set, it didn't finish even
 when I took a sample of two (rows in new data). I didn't receive an
 error. I just stopped it because it was taking too long. Any
 suggestions for improvements? Thanks.

 # start example
 # i have commented out the write.table statement below

 testData - read.table(textConnection(rs853,cv0084,A,A
 rs86,cv0084,C,B
 rs883,cv0084,E,F
 rs853,cv0085,G,H
 rs86,cv0085,I,J
 rs883,cv0085,K,L),header=FALSE,sep=,)
 closeAllConnections()

 mysamples - unique(testData$V2)

 for (one_ind in mysamples) {
 one_sample - testData[testData$V2==one_ind,]
 mywide - reshape(one_sample, timevar = V1, idvar =
 V2,direction = wide)
 # write.table(mywide,file
 =newdata.txt,append=TRUE,row.names=FALSE,col.names=FALSE,quote=FALSE)
 }

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide 
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.




 --
 Jim Holtman
 Cincinnati, OH
 +1 513 646 9390

 What is the problem that you are trying 

Re: [R] Query about using timestamps returned by SQL as 'factor' forsplit

2010-07-09 Thread Matthew Dowle
Hi Ted,

Well since you mentioned data.table (!) ...

If risk_input is a data.table consisting of 3 columns (m_id, sale_date, 
return_date) where the dates
are of class IDate (recently added to data.table by Tom) then try :

   risk_input[, fitdistr(return_date-sale_date,normal), by=list(m_id, 
year(sale_date), week(sale_date))]

Notice that the 'by' can contain expressions of columns, and lets you group 
by more than one expression.
You don't have to repeat the 'group by' expressions in the select, as you 
would do in SQL. data.table returns
those group columns automatically in the result, alongside the result of the 
j expression applied to each group.

If you need to aggregate by m_id, year and month rather than week another 
way is :

   risk_input[, fitdistr(return_date-sale_date,normal), by=list(m_id, 
round(sale_date,month))]

plyr and sqldf can do this task too by the way, and I'd highly recommend you 
take a look at those packages.

There are also many excellent datetime classes around which you could also 
consider.

The reason we need IDate in data.table is because data.table uses radix 
sorting, see ?sort.list. That is ultra fast for
integers. Again radix is something Tom added to data.table. The radix 
algorithm (see wikipedia) is specifically
designed to sort integers only. We would use Date, but that is stored as 
numeric. IDate is the same as Date
but stored as integer.

HTH,
Matthew


Ted Byers r.ted.by...@gmail.com wrote in message 
news:aanlktinchf3tfzkndcwolrwsxekgpfpjes3f8m5tq...@mail.gmail.com...
I have a simple query as follows:

 SELECT
 m_id,sale_date,YEAR(sale_date),WEEK(sale_date),return_type,DATEDIFF(return_date,sale_date)
 AS elapsed_time FROM risk_input

 I can get, and view, all the data that that query returns.  The question 
 is,
 sale_date is a timestamp, and I need to call split to group this data by
 m_id and the week in which the sale occurred.  Obviously, I would normally
 need both YEAR and WEEK so that data from April this year is not combined
 with that from last year (the system is non-autonomous).  And then I need 
 to
 use lapply to apply fitdistr to each subsample.

 Obviously, I can handle all this data in either a data.frame or in a
 data.table.

 There are two aspects of the question.

 1) Is there a function (or package) that will let me group (or regroup) 
 time
 series data into the week in which the data apply, properly taking into
 account the year that applies, in a single call passing sale_date as the
 argument?  If I can, then I can reduce the amount of data I draw from my
 MySQL server and the computational load it bears.

 2) The example provided for split splits only according to a single 
 variable
 (*g - airquality$Month;l - split(airquality, g)*).  How would that 
 example
 be changed if there were two or more columns in the data.frame that are
 needed to define the groups?  I.E. in my example, I'd need to group by 
 m_id,
 and the year and week values that can be computed from sale_date.

 Thanks

 Ted

 [[alternative HTML version deleted]]


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Performance enhancement for ave

2010-06-29 Thread Matthew Dowle

 dt = data.table(d,key=grp1,grp2)
 system.time(ans1 - dt[ , list(mean(x),mean(y)) , by=list(grp1,grp2)])
   user  system elapsed
   3.890.003.91# your 7.064 is 12.23 for me though, so this 
3.9 should be faster for you

However, Rprof() shows that 3.9 is mostly dispatch of mean to mean.default 
which then calls .Internal.  Because there are so many groups here, dispatch 
bites.

So ...

 system.time(ans2 - dt[ , list(.Internal(mean(x)),.Internal(mean(y))), 
 by=list(grp1,grp2)])
   user  system elapsed
   0.200.000.21

 identical(ans1,ans2)
TRUE



Hadley Wickham had...@rice.edu wrote in message 
news:aanlktilh_-3_cycf_fnqmhh6w2og5jj5u0yopx_qa...@mail.gmail.com...
 library(plyr)

 n-10
 grp1-sample(1:750, n, replace=T)
 grp2-sample(1:750, n, replace=T)
 d-data.frame(x=rnorm(n), y=rnorm(n), grp1=grp1, grp2=grp2)

 system.time({
  d$avx1 - ave(d$x, list(d$grp1, d$grp2))
  d$avy1 - ave(d$y, list(d$grp1, d$grp2))
 })
 #   user  system elapsed
 # 39.300   0.279  40.809
 system.time({
  d$avx2 - ave(d$x, interaction(d$grp1, d$grp2, drop = T))
  d$avy2 - ave(d$y, interaction(d$grp1, d$grp2, drop = T))
 })
 #  user  system elapsed
 # 6.735   0.209   7.064

 all.equal(d$avy1, d$avy2)
 # TRUE
 all.equal(d$avx1, d$avx2)
 # TRUE

 i.e. ave should use g - interaction(..., drop = TRUE)

 Hadley

 -- 
 Assistant Professor / Dobelman Family Junior Chair
 Department of Statistics / Rice University
 http://had.co.nz/


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] lapply or data.table to find a unit's previous transaction

2010-06-03 Thread Matthew Dowle
William,

Try a rolling join in data.table, something like this (untested) :

setkey(Data, UnitID, TranDt)# sort by unit then date
previous = transform(Data, TranDt=TranDt-1)
Data[previous,roll=TRUE]# lookup the prevailing date before, if any, 
for each row within that row's UnitID

Thats all it is, no loops required. That should be fast and memory 
efficient. 100's of times faster than a subquery in SQL.

If you have trouble please follow up on datatable-help.

Matthew


William Rogers whroger...@gmail.com wrote in message 
news:aanlktikk_avupm7j108iseryo9fucpnjhanxpaqvt...@mail.gmail.com...
I have a dataset of property transactions that includes the
transaction ID (TranID), property ID (UnitID), and transaction date
(TranDt). I need to create a data frame (or data table) that includes
the previous transaction date, if one exists.
This is an easy problem in SQL, where I just run a sub-query, but I'm
trying to make R my one-stop-shopping program. The following code
works on a subset of my data, but I can't run this on my full dataset
because my computer runs out of memory after about 30 minutes. (Using
a 32-bit machine.)
Use the following synthetic data for example.

n- 100
TranID- lapply(n:(2*n), function(x) (
as.matrix(paste(x, sample(seq(as.Date('2000-01-01'),
as.Date('2010-01-01'), days), sample(1:5, 1)), sep= D), ncol= 1)))
TranID- do.call(rbind, TranID)
UnitID- substr(TranID, 1, nchar(n))
TranDt- substr(TranID, nchar(n)+2, nchar(n)+11)
Data- data.frame(TranID= TranID, UnitID= UnitID, TranDt= as.Date(TranDt))

#First I create a list of all the previous transactions by unit

TranList- as.matrix(Data$TranID, ncol= 1)
PreTran- lapply(TranList,
function(x) (with(Data,
Data[
UnitID== substr(x, 1, nchar(n))
TranDt Data[TranID== x, TranDt], ]
))
)

#I do get warnings about missing data because some transactions have
no predecessor.
#Some transactions have no previous transactions, others have many so
I pick the most recent

BeforeTran- lapply(seq_along(PreTran), function(x) (
with(PreTran[[x]], PreTran[[x]][which(TranDt== max(TranDt)), ])))

#I need to add the current transaction's TranID to the list so I can merge 
later

BeforeTran- lapply(seq_along(PreTran), function(x) (
transform(BeforeTran[[x]], TranID= TranList[x, 1])))

#Finally, I convert from a list to a data frame

BeforeTran- do.call(rbind, BeforeTran)

#I have used a combination of data.table and for loops, but that seems
cheesey and doesn't preform much better.

library(data.table)

#First I create a list of all the previous transactions by unit

TranList2- vector(nrow(Data), mode= list)
names(TranList2)- levels(Data$TranID)
DataDT- data.table(Data)

#Use a for loop and data.table to find the date of the previous transaction

for (i in levels(Data$TranID)) {
if (DataDT[UnitID== substr(i, 1, nchar(n))
TranDt= (DataDT[TranID== i, TranDt]),
length(TranDt)] 1)
TranList2[[i]]- cbind(TranID= i,
DataDT[UnitID== substr(i, 1, nchar(n))
TranDt (DataDT[TranID== i, TranDt]),
list(TranDt= max(TranDt))])
}

#Finally, I convert from a list to a data table

BeforeTran2- do.call(rbind, TranList2)

#My intution says that this code doesn't take advantage of
data.table's attributes.
#Are there any ideas out there? Thank you.
#P.S. I've tried plyr and it does not help my memory problem.

--
William H. Rogers

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] [R-pkgs] data.table 1.4.1 now on CRAN

2010-05-07 Thread Matthew Dowle
data.table is an enhanced data.frame with fast subset, fast
grouping and fast merge. It uses a short and flexible syntax
which extends existing R concepts.

Example:
DT[a3,sum(b*c),by=d]
where DT is a data.table with 4 columns (a,b,c,d).

data.table 1.4.1 :

* grouping is now 10+ times faster than tapply()
* extract is 100+ times faster than ==, as before
* 3 new vignettes: Intro, FAQ  Timings
* NEWS file contains further details

http://datatable.r-forge.r-project.org/

http://cran.r-project.org/web/packages/data.table/index.html

There is a new mailing list, datatable-help. Please do send comments,
feedback, problems and questions.

Matthew and Tom

___
R-packages mailing list
r-packa...@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-packages

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Using plyr::dply more (memory) efficiently?

2010-04-29 Thread Matthew Dowle
I don't know about that,  but try this :

install.packages(data.table, repos=http://R-Forge.R-project.org;)
require(data.table)
summaries = data.table(summaries)
summaries[,sum(counts),by=symbol]

Please let us know if that returns the correct result,  and if its 
memory/speed is ok ?

Matthew

Steve Lianoglou mailinglist.honey...@gmail.com wrote in message 
news:w2kbbdc7ed01004290606lc425e47cs95b36f6bf0a...@mail.gmail.com...
 Hi all,

 In short:

 I'm running ddply on an admittedly (somehow) large data.frame (not
 that large). It runs fine until it finishes and gets to the
 collating part where all subsets of my data.frame have been
 summarized and they are being reassembled into the final summary
 data.frame (sorry, don't know the correct plyr terminology). During
 collation, my R workspace RAM usage goes from about 1.5 GB upto 20GB
 until I kill it.

 Running a similar piece of code that iterates manually w/o ddply by
 using a combo of lapply and a do.call(rbind, ...) uses considerably
 less ram (tops out at about 8GB).

 How can I use ddply more efficiently?

 Longer:

 Here's more info:

 * The data.frame itself ~ 15.8 MB when loaded.
 * ~ 400,000 rows, 8 columns

 It looks like so:

   exon.start exon.width exon.width.unique exon.anno counts
 symbol   transcript  chr
 14225468 0   utr  0
 WASH5P   WASH5P chr1
 24833 69 0   utr  1
 WASH5P   WASH5P chr1
 3565915238   utr  1
 WASH5P   WASH5P chr1
 46470159 0   utr  0
 WASH5P   WASH5P chr1
 56721198 0   utr  0
 WASH5P   WASH5P chr1
 67096136 0   utr  0
 WASH5P   WASH5P chr1
 77469137 0   utr  0
 WASH5P   WASH5P chr1
 87778147 0   utr  0
 WASH5P   WASH5P chr1
 98131 99 0   utr  0
 WASH5P   WASH5P chr1
 10  14601154 0   utr  0
 WASH5P   WASH5P chr1
 11  19184 50 0   utr  0
 WASH5P   WASH5P chr1
 12   469314036intron  2
 WASH5P   WASH5P chr1
 13   490275736intron  1
 WASH5P   WASH5P chr1
 14   5811659   144intron 47
 WASH5P   WASH5P chr1
 15   6629 9221intron  1
 WASH5P   WASH5P chr1
 16   6919177 0intron  0
 WASH5P   WASH5P chr1
 17   723223735intron  2
 WASH5P   WASH5P chr1
 18   7606172 0intron  0
 WASH5P   WASH5P chr1
 19   7925206 0intron  0
 WASH5P   WASH5P chr1
 20   8230   6371   109intron 67
 WASH5P   WASH5P chr1
 21  14755   442955intron 12
 WASH5P   WASH5P chr1
 ...

 I'm ply-ing over the transcript column and the function transforms
 each such subset of the data.frame into a new data.frame that is just
 1 row / transcript that basically has the sum of the counts for each
 transcript.

 The code would look something like this (`summaries` is the data.frame
 I'm referring to):

 rpkm - ddply(summaries, .(transcript), function(df) {
  data.frame(symbol=df$symbol[1], counts=sum(df$counts))
 }

 (It actually calculates 2 more columns that are returned in the
 data.frame, but I'm not sure that's really important here).

 To test some things out, I've written another function to manually
 iterate/create subsets of my data.frame to summarize.

 I'm using sqldf to dump the data.frame into a db, then I lapply over
 subsets of the db `where transcript=x` to summarize each subset of my
 data into a list of single-row data.frames (like ddply is doing), and
 finish with a `do.call(rbind, the.dfs)` o nthis list.

 This returns the same exact result ddply would return, and by the time
 `do.call` finishes, my RAM usage hits about 8gb.

 So, what am I doing wrong with ddply that makes the difference ram
 usage in the last step (collation -- the equivalent of my final
 `do.call(rbind, my.dfs)` be more than 12GB?

 Thanks,
 -steve

 -- 
 Steve Lianoglou
 Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
 Contact Info: http://cbio.mskcc.org/~lianos/contact


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Using plyr::dply more (memory) efficiently?

2010-04-29 Thread Matthew Dowle

Steve Lianoglou mailinglist.honey...@gmail.com wrote in message 
news:t2ybbdc7ed01004290812n433515b5vb15b49c170f5a...@mail.gmail.com...

 Thanks for directing me to the data.table package. I read through some
 of the vignettes, and it looks quite nice.

 While your sample code would provide answer if I wanted to just
 compute some summary statistic/function of groups of my data.frame
 (using `by=symbol`), what's the best way to produces several pieces of
 info per subset.

 For instance, I see that I can do something like this:

 summaries[, list(counts=sum(counts), width=sum(exon.width)), by=symbol]

Yes, thats it.

 But what if I need to do some more complex processing within the
 subsets defined in `by=symbol` -- like several lines of programming
 logic for 1 result, say.

 I guess I can open a new block that just returns a data.table? Like:

 summaries[, {
  cnts - sum(counts)
  ew - sum(exon.width)
  # ... some complex things
  complex - # .. result of complex things
  data.table(counts=cnts, width=ew, cplx=complex)
}, by=symbol]

 Is that right? (I mean, it looks like it's working, but maybe there's
 a more idiomatic way(?))

Yes, you got it.  Rather than a data.table at the end though, just return a 
list, its faster.
Shorter vectors will still be recycled to match any longer ones.

Or just this :

summaries[, list(
counts = sum(counts),
width = sum(exon.width),
cplx = # .. result of complex things
), by=symbol]


Sounds like its working,  but could you give us an idea whether it is quick 
and memory efficient ?

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] sum specific rows in a data frame

2010-04-20 Thread Matthew Dowle

Or try data.table 1.4 on r-forge, its grouping is faster than aggregate :

 agg datatable
X100.012 0.008
X100   0.020 0.008
X1000  0.172 0.020
X1 1.164 0.144
X1e.05 9.397 1.180

install.packages(data.table, repos=http://R-Forge.R-project.org;)
require(data.table)
dt = as.data.table(df)
t3 - system.time(zz3 - dt[, list(sumflt=sum(fltval), sumint=sum
(intval)), by=id])

Matthew


On Thu, 15 Apr 2010 13:09:17 +, hadley wickham wrote:
 On Thu, Apr 15, 2010 at 1:16 AM, Chuck vijay.n...@gmail.com wrote:
 Depending on the size of the dataframe and the operations you are
 trying to perform, aggregate or ddply may be better.  In the function
 below, df has the same structure as your dataframe.
 
 Current version of plyr:
 
  agg  ddply
 X100.005  0.007
 X100   0.007  0.026
 X1000  0.086  0.248
 X1 0.577  3.136
 X1e.05 4.493 44.147
 
 Development version of plyr:
 
  agg ddply
 X100.003 0.005
 X100   0.007 0.007
 X1000  0.042 0.044
 X1 0.410 0.443
 X1e.05 4.479 4.237
 
 So there are some big speed improvements in the works.
 
 Hadley

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] match function or ==

2010-04-08 Thread Matthew Dowle
Please install v1.3 from R-forge :
install.packages(data.table,repos=http://R-Forge.R-project.org;)

It will be ready for CRAN soon.

Please follow up on datatable-h...@lists.r-forge.r-project.org

Matthew

bo bozha...@hotmail.com wrote in message 
news:1270689586866-1755876.p...@n4.nabble.com...

 Thank you very much for the help.

 I installed data.table package, but I keep getting the following warnings:

 setkey(DT,id,date)
 Warning messages:
 1: In `[.data.table`(deref(x), o) :
 This R session is  2.4.0. Please upgrade to 2.4.0+.

 I'm using R 2.10, but why I keep getting warnings on upgrades. Thanks 
 again.


 -- 
 View this message in context: 
 http://n4.nabble.com/match-function-or-tp1754505p1755876.html
 Sent from the R help mailing list archive at Nabble.com.


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Code is too slow: mean-centering variables in a dataframebysubgroup

2010-04-08 Thread Matthew Dowle
Hi Dimitri,

A start has been made at explaining .SD in FAQ 2.1. This was previously on a 
webpage, but its just been moved to a vignette :

https://r-forge.r-project.org/plugins/scmsvn/viewcvs.php/*checkout*/branch2/inst/doc/faq.pdf?rev=68root=datatable

Please note: that vignette is part of a development branch on r-forge, and 
as such isn't even released to the r-forge repository yet.

Please also see FAQ 4.5 in that vignette and follow up on 
datatable-h...@lists.r-forge.r-project.org

An introduction vignette is taking shape too (again, in the development 
branch i.e. bleeding edge) :

https://r-forge.r-project.org/plugins/scmsvn/viewcvs.php/*checkout*/branch2/inst/doc/intro.pdf?rev=68root=datatable

HTH
Matthew


Dimitri Liakhovitski ld7...@gmail.com wrote in message 
news:r2rdae9a2a61004071314xc03ae851n4c9027b28df5a...@mail.gmail.com...
Yes, Tom's solution is indeed the fastest!
On my PC it took .17-.22 seconds while using ave() took .23-.27 seconds.
And of course - the last two methods I mentioned took 1.3 SECONDS, not
MINUTES (it was a typo).

All that is left to me is to understand what .SD stands for.
:-)

Dimitri

On Wed, Apr 7, 2010 at 4:04 PM, Rob Forler rfor...@uchicago.edu wrote:
 Leave it up to Tom to solve things wickedly fast :)

 Just as an fyi Dimitri, Tom is one of the developers of data.table.

 -Rob

 On Wed, Apr 7, 2010 at 2:51 PM, Dimitri Liakhovitski ld7...@gmail.com
 wrote:

 Wow, thank you, Tom!

 On Wed, Apr 7, 2010 at 3:46 PM, Tom Short tshort.rli...@gmail.com 
 wrote:
  Here's how I would have done the data.table method. It's a bit faster
  than the ave approach on my machine:
 
  # install.packages(data.table,repos=http://R-Forge.R-project.org;)
  library(data.table)
 
  f3 - function(frame) {
  + frame - as.data.table(frame)
  + frame[, lapply(.SD[,2:ncol(.SD), with = FALSE],
  + function(x) x / mean(x, na.rm = TRUE)),
  + by = group]
  + }
 
  system.time(new.frame2 - f2(frame)) # ave
  user system elapsed
  0.50 0.08 1.24
  system.time(new.frame3 - f3(frame)) # data.table
  user system elapsed
  0.25 0.01 0.30
 
  - Tom
 
  Tom Short
 
 
  On Wed, Apr 7, 2010 at 12:46 PM, Dimitri Liakhovitski 
  ld7...@gmail.com
  wrote:
  I would like to thank once more everyone who helped me with this
  question.
  I compared the speed for different approaches. Below are the results
  of my comparisons - in case anyone is interested:
 
  ### Building an EXAMPLE FRAME with N rows - with groups and a lot of
  NAs:
  N-10
  set.seed(1234)
 
  frame-data.frame(group=rep(paste(group,1:10),N/10),a=rnorm(1:N),b=rnorm(1:N),c=rnorm(1:N),d=rnorm(1:N),e=rnorm(1:N),f=rnorm(1:N),g=rnorm(1:N))
  frame-frame[order(frame$group),]
 
  ## Introducing 60% NAs:
  names.used-names(frame)[2:length(frame)]
  set.seed(1234)
  for(i in names.used){
  i.for.NA-sample(1:N,round((N*.6),0))
  frame[[i]][i.for.NA]-NA
  }
  lapply(frame[2:8], function(x) length(x[is.na(x)])) # Checking that it
  worked
  ORIGframe-frame ## placeholder for the unchanged original frame
 
  ### Objective of the code - divide each value by its group mean
  
 
  ### METHOD 1 - the FASTEST - using 
  ave():##
  frame-ORIGframe
  f2 - function(frame) {
  for(i in 2:ncol(frame)) {
  frame[,i] - ave(frame[,i], frame[,1],
  FUN=function(x)x/mean(x,na.rm=TRUE))
  }
  frame
  }
  system.time({new.frame-f2(frame)})
  # Took me 0.23-0.27 sec
  ###
 
  ### METHOD 2 - fast, just a bit slower - using data.table:
  ##
 
  # If you don't have it - install the package - NOT from CRAN:
  install.packages(data.table,repos=http://R-Forge.R-project.org;)
  library(data.table)
  frame-ORIGframe
  system.time({
  table-data.table(frame)
  colMeanFunction-function(data,key){
  data[[key]]=NULL
 
  ret=as.matrix(data)/matrix(rep(as.numeric(colMeans(as.data.frame(data),na.rm=T)),nrow(data)),nrow=nrow(data),ncol=ncol(data),byrow=T)
  return(ret)
  }
  groupedMeans = table[,colMeanFunction(.SD, group), by=group]
  names.to.use-names(groupedMeans)
  for(i in
  1:length(groupedMeans)){groupedMeans[[i]]-as.data.frame(groupedMeans[[i]])}
  groupedMeans-do.call(cbind, groupedMeans)
  names(groupedMeans)-names.to.use
  })
  # Took me 0.37-.45 sec
  ###
 
  ### METHOD 3 - fast, a tad slower (using model.matrix  matrix
  multiplication):##
  frame-ORIGframe
  system.time({
  mat - as.matrix(frame[,-1])
  mm - model.matrix(~0+group,frame)
  col.grp.N - crossprod( !is.na(mat), mm ) # Use this line if don't
  want to use NAs for mean calculations
  # col.grp.N - crossprod( mat != 0 , mm ) # Use this line if don't
  want to use zeros for mean calculations
  mat[is.na(mat)] - 0.0
  col.grp.sum - crossprod( mat, mm )
  mat - mat / ( t(col.grp.sum/col.grp.N)[ frame$group,] )
  is.na(mat) - is.na(frame[,-1])
  mat-as.data.frame(mat)
  })
  # Took me 0.44-0.50 sec
  ###
 
  ### 

Re: [R] memory error

2010-04-06 Thread Matthew Dowle
 someone else on this list may be able to give you a ballpark estimate
 of how much RAM this merge would require.

I don't have an absolute estimate, but try data.table::merge, as it needs 
less
working memory than base::merge.

20 million rows of 5 columns isn't beyond 32bit :
   (1*4 + 4*8)*19758564/1024^3 = 0.662GB

Also try sqldf to do the join.

Matthew


Sharpie ch...@sharpsteen.net wrote in message
news:1270102758449-1747733.p...@n4.nabble.com...


 Janet Choate-2 wrote:

 Thanx for clarification on stating my problem, Charlie.

 I am attempting to merge to files, i.e.:
 hi39 = merge(comb[,c(hillID,geo)], hi.h39, by=c(hillID))

 if this is relevant or helps to explain:
 the file 'comb' is 3 columns and 1127 rows
 the file 'hi.h39' is 5 columns and 19758564 rows

 i started a new clean R session in which i was able to read those 2 files
 in, but get the following error when i try to merge them:

 R(2175) malloc: *** mmap(size=79036416) failed (error code=12)
 *** error: can't allocate region
 *** set a breakpoint in malloc_error_break to debug
 R(2175) malloc: *** mmap(size=79036416) failed (error code=12)
 *** error: can't allocate region
 *** set a breakpoint in malloc_error_break to debug
 R(2175) malloc: *** mmap(size=158068736) failed (error code=12)
 *** error: can't allocate region
 *** set a breakpoint in malloc_error_break to debug
 R(2175) malloc: *** mmap(size=158068736) failed (error code=12)
 *** error: can't allocate region
 *** set a breakpoint in malloc_error_break to debug
 R(2175) malloc: *** mmap(size=158068736) failed (error code=12)
 *** error: can't allocate region
 *** set a breakpoint in malloc_error_break to debug
 Error: cannot allocate vector of size 150.7 Mb

 so the final error is Cannot allocate vector of size 150.7 Mb, as
 suggested when R runs out of memory.

 i am running R version 2.9.2, on mac os X 10.5 - leopard.

 any suggestion on how to increase R's memory on a mac?
 thanx for any much needed help!
 Janet


 Ah, so it is indeed a shortage of memory problem.  With R 2.9.2, you are
 likely running a 32 bit version of R which will be limited to accessing at
 most 4 GB of RAM. You may want to try the newest version of R, 2.10.1, as
 it includes a 64 bit version that will allow you to access significantly
 more memory- provided you have the RAM installed on your system.

 I'm not too hot on memory usage calculation, but someone else on this list
 may be able to give you a ballpark estimate of how much RAM this merge
 would
 require.  If it turns out to be a ridiculous amount, you will need to
 consider breaking the merge up into chunks or finding an out-of-core (i.e.
 not dependent on RAM for storage) merge tool.

 Hope this helps!

 -Charlie

 -
 Charlie Sharpsteen
 Undergraduate-- Environmental Resources Engineering
 Humboldt State University
 -- 
 View this message in context:
 http://n4.nabble.com/memory-error-tp1747357p1747733.html
 Sent from the R help mailing list archive at Nabble.com.


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Adding RcppFrame to RcppResultSet causes segmentation fault

2010-04-01 Thread Matthew Dowle
Rob,
Please look again at Romain's reply to you on 19th March.  He informed you 
then that Rcpp has its own dedicated mailing list and he gave you the link.
Matthew

R_help Help rhelp...@gmail.com wrote in message 
news:ad1ead5f1003291753p68d6ed52q572940f13e1c0...@mail.gmail.com...
 Hi,

 I'm a bit puzzled. I uses exactly the same code in RcppExamples
 package to try adding RcppFrame object to RcppResultSet. When running
 it gives me segmentation fault problem. I'm using gcc 4.1.2 on redhat
 64bit. I'm not sure if this is the cause of the problem. Any advice
 would be greatly appreciated. Thank you.

 Rob.


 int numCol=4;
 std::vectorstd::string colNames(numCol);
 colNames[0] = alpha; // column of strings
 colNames[1] = beta;  // column of reals
 colNames[2] = gamma; // factor column
 colNames[3] = delta; // column of Dates
 RcppFrame frame(colNames);

 // Third column will be a factor. In the current implementation the
 // level names are copied to every factor value (and factors
 // in the same column must have the same level names). The level names
 // for a particular column will be factored out (pardon the pun) in
 // a future release.
 int numLevels = 2;
 std::string *levelNames = new std::string[2];
 levelNames[0] = std::string(pass); // level 1
 levelNames[1] = std::string(fail); // level 2

 // First row (this one determines column types).
 std::vectorColDatum row1(numCol);
 row1[0].setStringValue(a);
 row1[1].setDoubleValue(3.14);
 row1[2].setFactorValue(levelNames, numLevels, 1);
 row1[3].setDateValue(RcppDate(7,4,2006));
 frame.addRow(row1);

 // Second row.
 std::vectorColDatum row2(numCol);
 row2[0].setStringValue(b);
 row2[1].setDoubleValue(6.28);
 row2[2].setFactorValue(levelNames, numLevels, 1);
 row2[3].setDateValue(RcppDate(12,25,2006));
 frame.addRow(row2);

 RcppResultSet rs;
 rs.add(PreDF, frame);


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Adding RcppFrame to RcppResultSet causes segmentation fault

2010-04-01 Thread Matthew Dowle
He could have posted into this thread then at the time to say that. 
Otherwise it appears like its open.

Romain Francois romain.franc...@dbmail.com wrote in message 
news:4bb4c4b8.2030...@dbmail.com...
The thread has been handled in Rcpp-devel. Rob posted there 7 minutes
after posting on r-help.

FWIW, I think the problem is fixed on the Rcpp 0.7.11 version (on cran
incoming)

Romain

Le 01/04/10 17:47, Matthew Dowle a écrit :

 Rob,
 Please look again at Romain's reply to you on 19th March.  He informed you
 then that Rcpp has its own dedicated mailing list and he gave you the 
 link.
 Matthew

 R_help Helprhelp...@gmail.com  wrote in message
 news:ad1ead5f1003291753p68d6ed52q572940f13e1c0...@mail.gmail.com...
 Hi,

 I'm a bit puzzled. I uses exactly the same code in RcppExamples
 package to try adding RcppFrame object to RcppResultSet. When running
 it gives me segmentation fault problem. I'm using gcc 4.1.2 on redhat
 64bit. I'm not sure if this is the cause of the problem. Any advice
 would be greatly appreciated. Thank you.

 Rob.


 int numCol=4;
 std::vectorstd::string  colNames(numCol);
 colNames[0] = alpha; // column of strings
 colNames[1] = beta;  // column of reals
 colNames[2] = gamma; // factor column
 colNames[3] = delta; // column of Dates
 RcppFrame frame(colNames);

 // Third column will be a factor. In the current implementation the
 // level names are copied to every factor value (and factors
 // in the same column must have the same level names). The level names
 // for a particular column will be factored out (pardon the pun) in
 // a future release.
 int numLevels = 2;
 std::string *levelNames = new std::string[2];
 levelNames[0] = std::string(pass); // level 1
 levelNames[1] = std::string(fail); // level 2

 // First row (this one determines column types).
 std::vectorColDatum  row1(numCol);
 row1[0].setStringValue(a);
 row1[1].setDoubleValue(3.14);
 row1[2].setFactorValue(levelNames, numLevels, 1);
 row1[3].setDateValue(RcppDate(7,4,2006));
 frame.addRow(row1);

 // Second row.
 std::vectorColDatum  row2(numCol);
 row2[0].setStringValue(b);
 row2[1].setDoubleValue(6.28);
 row2[2].setFactorValue(levelNames, numLevels, 1);
 row2[3].setDateValue(RcppDate(12,25,2006));
 frame.addRow(row2);

 RcppResultSet rs;
 rs.add(PreDF, frame);


-- 
Romain Francois
Professional R Enthusiast
+33(0) 6 28 91 30 30
http://romainfrancois.blog.free.fr
|- http://tr.im/OIXN : raster images and RImageJ
|- http://tr.im/OcQe : Rcpp 0.7.7
`- http://tr.im/O1wO : highlight 0.1-5

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] nlrq parameter bounds

2010-04-01 Thread Matthew Dowle
Ashley,

This appears to be your first post to this list. Welcome to R. Over 2 days
is quite a long time to wait though, so you are unlikely to get a reply now.

Feedback:  since nlrq is in package quantreg, its a question about a package 
and should
be sent to the package maintainer. Some packages though, over 40 of the 664 
on
r-forge, have dedicated help/devel/forum lists hosted on r-forge.

No reply from r-help often, but not always, means you haven't followed some
detail of the posting guide or haven't followed this :
http://www.catb.org/~esr/faqs/smart-questions.html.

HTH
Matthew


Ashley Greenwood a.greenwo...@pgrad.unimelb.edu.au wrote in message 
news:45708.131.217.6.9.1269916052.squir...@webmail.student.unimelb.edu.au...
 Hi there,
 Can anyone please tell me if it is possible to limit parameters in nlrq()
 to 'upper' and 'lower' bounds as per nls()?  If so how??

 Many thanks in advance


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Error grid must have equal distances in each direction

2010-03-31 Thread Matthew Dowle
M Joshi,

I don't know but I guess that some might have looked at your previous thread 
on 14 March (also about the geoR package). You received help and good advice 
then, but it doesn't appear that you are following it.  It appears to be a 
similar problem this time.

Also, this list is the wrong place for that question. Please read the 
posting guide to find out the correct place. Its a question about a package.

HTH,
Matthew


maddy madhura1...@gmail.com wrote in message 
news:1269974076132-1745651.p...@n4.nabble.com...

 Hello All,

 Can anyone please help me on this error?

 Error in FUN(X[[1L]], ...) :
  different grid distances detected, but the grid must have equal distances
 in each direction -- try gridtriple=TRUE that avoids numerical errors.

 The program that I am trying to run posted in the previous post of this
 thread.After the rows 1021 of my matrix of size 1024*1024, I start getting
 all the values as 0s.
 How to set the gridtriple as I am using the grf function which does not 
 take
 this parameter as input.

 The maximum vector limit that can be reached in 'R' is 2^30, why does it 
 not
 allow me to create arrays of length even of size 2^17?

 Thanks,
 M Joshi
 -- 
 View this message in context: 
 http://n4.nabble.com/Error-grid-must-have-equal-distances-in-each-direction-tp1695189p1745651.html
 Sent from the R help mailing list archive at Nabble.com.


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Question about 'logit' and 'mlogit' in Zelig

2010-03-31 Thread Matthew Dowle
Abraham,

This appears to be your 3rd unanswered post to r-help in March, all 3 have 
been about the Zelig package.

Please read the posting guide and find out the correct place to send 
questions about packages.  Then you might get an answer.

HTH
Matthew


Mathew, Abraham T amat...@ku.edu wrote in message 
news:281f7a5fdfef844696011cb21185f8ac0be...@mailbox-11.home.ku.edu...


I'm running a multinomial logit in R using the Zelig packages. According to 
str(trade962a), my dependent variable is a factor with three levels. When I 
run the multinomial logit I get an error message. However, when I run 
'model=logit' it works fine. any ideas on whats wrong?

## MULTINOMIAL LOGIT
anes96two - zelig(trade962a ~ age962 + education962 + personal962 + 
economy962 + partisan962 + employment962 + union962 + home962 + market962 + 
race962 + income962, model=logit, data=data96)
summary(anes96two)

#Error in attr(tt, depFactors)$depFactorVar :
#  $ operator is invalid for atomic vectors


## LOGIT
Call:
zelig(formula = trade962a ~ age962 + education962 + personal962 +
economy962 + partisan962 + employment962 + union962 + home962 +
market962 + race962 + income962, model = logit, data = data96)

Deviance Residuals:
   Min  1Q  Median  3Q Max
-2.021  -1.179   0.764   1.032   1.648

Coefficients:
   Estimate Std. Error z value Pr(|z|)
(Intercept)   -0.697675   0.600991  -1.161   0.2457
age962 0.003235   0.004126   0.784   0.4330
education962  -0.065198   0.038002  -1.716   0.0862 .
personal9620.006827   0.072421   0.094   0.9249
economy962-0.200535   0.084554  -2.372   0.0177 *
partisan9620.092361   0.079005   1.169   0.2424
employment962 -0.009346   0.044106  -0.212   0.8322
union962  -0.016293   0.149887  -0.109   0.9134
home962   -0.150221   0.133685  -1.124   0.2611
market962  0.292320   0.128636   2.272   0.0231 *
race9620.205828   0.094890   2.169   0.0301 *
income962  0.263363   0.048275   5.455 4.89e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 1841.2  on 1348  degrees of freedom
Residual deviance: 1746.3  on 1337  degrees of freedom
  (365 observations deleted due to missingness)
AIC: 1770.3

Number of Fisher Scoring iterations: 4




Thanks
Abraham

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] zero standard errors with geeglm in geepack

2010-03-31 Thread Matthew Dowle
You may not have got an answer because you posted to the wrong place. Its a 
question about a package. Please read the posting guide.

miriza miri...@sfwmd.gov wrote in message 
news:1269886286228-1695430.p...@n4.nabble.com...

 Hi!

 I am using geeglm to fit a Poisson model to a timeseries of count data as
 follows.  Since there are no clusters I use 73 values of 1 for the ids. 
 The
 problem I have is that I am getting standard errors of zero for the
 parameters.  What am I doing wrong?
 Thanks, Michelle
  N_Base
 [1]  95  85 104  88 102 104  91  88  85 115  96  83  91 107  96 116 118 
 103
 89  88 101 117  82  80  83 103 115 119  95  90  82  91 108 115  93  96  72
 [38]  98  95  98  97 104  86 107  92  94  95 100 107  76 104 101  80 102 
 100
 91  96  89  71 109  97 113  99 127 115  91  81  73  69  92  90  78  57
 Year
 [1] 1932 1933 1934 1935 1936 1937 1938 1939 1940 1941 1942 1943 1944 1945
 1946 1947 1948 1949 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960
 1961
 [31] 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975
 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990
 1991
 [61] 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2006

 tes=geese(formula = N_Base ~ Year, id = rep(1, 73), family = poisson,
 corstr = ar1)
 summary(tes)

 Call:
 geese(formula = N_Base ~ Year, id = rep(1, 73), family = poisson,
corstr = ar1)

 Mean Model:
 Mean Link: log
 Variance to Mean Relation: poisson

 Coefficients:
estimate san.se wald p
 (Intercept)   7.1131  0  Inf 0
 Year -0.0013  0  Inf 0

 Scale Model:
 Scale Link:identity

 Estimated Scale Parameters:
estimate san.se wald p
 (Intercept) 1.79  0  Inf 0

 Correlation Model:
 Correlation Structure: ar1
 Correlation Link:  identity

 Estimated Correlation Parameters:
  estimate san.se wald p
 alpha0.187  0  Inf 0

 Returned Error Value:0
 Number of clusters:   1   Maximum cluster size: 73

 -- 
 View this message in context: 
 http://n4.nabble.com/zero-standard-errors-with-geeglm-in-geepack-tp1695430p1695430.html
 Sent from the R help mailing list archive at Nabble.com.


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] GEE for a timeseries of count (one cluster)

2010-03-31 Thread Matthew Dowle
Contact the authors of those packages ?

miriza miri...@sfwmd.gov wrote in message 
news:1269981675252-1745896.p...@n4.nabble.com...

 Hi!

 I was wondering if there were any packages that would allow me to fit a 
 GEE
 to a single timeseries of counts so that I could account for 
 autocorrelation
 in the data.  I tried gee, geepack and yags packages, but I do not get
 standard errors for the parameters when using a single cluster.  Any tips?

 Thanks, Michelle
 -- 
 View this message in context: 
 http://n4.nabble.com/GEE-for-a-timeseries-of-count-one-cluster-tp1745896p1745896.html
 Sent from the R help mailing list archive at Nabble.com.


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] mcmcglmm starting value example

2010-03-31 Thread Matthew Dowle
Apparently not, since this your 3rd unanswered thread to r-help this month 
about this package.

Please read the posting guide and find out where you should send questions 
about packages.  Then you might get an answer.


ping chen chen1984...@yahoo.com.cn wrote in message 
news:975148.47160...@web15304.mail.cnb.yahoo.com...
 Hi R-users:

 Can anyone give an example of giving starting values for MCMCglmm?
 I can't find any anywhere.
 I have 1 random effect (physicians, and there are 50 of them)
 and family=ordinal.

 How can I specify starting values for my fixed effects? It doesn't seem to 
 have the option to do so.

 Thanks, Ping




 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide 
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] GLM / large dataset question

2010-03-31 Thread Matthew Dowle
Geelman,

This appears to be your first post to this list. Welcome to R. Nearly 2 days 
is quite a long time to wait though, so you are unlikely to get a reply now.

Feedback : the question seems quite vague and imprecise. It depends on which 
R you mean (32bit/64bit) and how much ram you have.  It also depends on your 
data and what you want to do with it. Did you mean 100.000 (i.e. one 
hundred) or 100,000.  Also, '8000 explanatory variables' seems a lot, 
especially to be stored in 'a factor'.  There is no R code in your post so 
we can't tell if you're using glm correctly or not.  You could provide the 
result of object.size(), and dim() on your data rather than explaining it in 
words.

No reply often, but not always, means you haven't followed some detail of 
the posting guide or haven't followed this : 
http://www.catb.org/~esr/faqs/smart-questions.html.

HTH
Matthew

geelman geel...@zonnet.nl wrote in message 
news:mkedkcmimcmgohidffmbieklcaaa.geel...@zonnet.nl...
 LS,

 How large a dataset can glm fit with a binomial link function?  I have a 
 set
 of about 100.000 observations and about 8000 explanatory variables (a 
 factor
 with 8000 levels).

 Is there a way to find out how large datasets R can handle in general?



 Thanks in advance,


 geelman


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Combing

2010-03-29 Thread Matthew Dowle
Val,

Type combine two data sets (text you wrote in your post) into 
www.rseek.org. The first two links are: Quick-R: Merge and Merging data: 
A tutorial.  Isn't it quicker for you to use rseek, rather than the time it 
takes to write a post and wait for a reply ?  Don't you also get more 
detailed information that way too ?

You already received advice from others on this list to look at 
www.rseek.org on 26 Oct,  package 'sos' on 27 Oct, and to 'read the manuals 
and FAQs before posting' on 5 Nov.

This month you have posted 3 times : Loop, Renumbering and Combing.

References :
1. Posting Guide headings : Do your homework before posting and Further 
resources
2. Contributed Documentation e.g. 'R Reference Card' by Tom Short 
http://cran.r-project.org/doc/contrib/Short-refcard.pdf.
3. Eric Raymond's essay http://www.catb.org/~esr/faqs/smart-questions.html. 
e.g. you posted to r-help 10 times so far,  9 of the 10 subjects were either 
a single word, or a single function name.

HTH
Matthew


Val valkr...@gmail.com wrote in message 
news:cdc083ac1003290413s7e047e25lc4202568af119...@mail.gmail.com...
 Hi all,

 I want to combine two data sets (ZA and ZB to get ZAB).
 The common variable between the two data sets is ID.

 Data  ZA
 ID F  M
 1  0  0
 2  0  0
 3  1  2
 4  1  0
 5  3  2
 6  5  4

 Data ZB

 ID  v1  v2 v3
 3  2.5 3.4 302
 4  8.6 2.9 317
 5  9.7 4.0 325
 6  7.5 1.9 296

 Output (ZAB)

 ID F  M  v1  v2  v3
 1  0  0   -9  -9  -9
 2  0  0   -9  -9  -9
 3  1  2  2.5 3.4 302
 4  1  0  8.6 2.9 317
 5  3  2  9.7 4.0 325
 6  5  4  7.5 1.9 296

 Any help is highly appreciated in advance,

 Val

 [[alternative HTML version deleted]]


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] NA values in indexing

2010-03-26 Thread Matthew Dowle
The type of 'NA' is logical. So x[NA] behaves more like x[TRUE] i.e. silent 
recycling.

 class(NA)
[1] logical
 x=101:108
 x[NA]
[1] NA NA NA NA NA NA NA NA
 x[c(TRUE,NA)]
[1] 101  NA 103  NA 105  NA 107  NA

 x[as.integer(NA)]
[1] NA

HTH
Matthew

Barry Rowlingson b.rowling...@lancaster.ac.uk wrote in message 
news:d8ad40b51003260509y6b671e53o9f79142d2b52c...@mail.gmail.com...
If you index a vector with a vector that has NA in it, you get NA back:

  x=101:107
  x[c(NA,4,NA)]
 [1]  NA 104  NA
  x[c(4,NA)]
 [1] 104  NA

All well and good. ?[ says, under NAs in indexing:

 When extracting, a numerical, logical or character ‘NA’ index
 picks an unknown element and so returns ‘NA’ in the corresponding
 element of a logical, integer, numeric, complex or character
 result, and ‘NULL’ for a list.  (It returns ‘00’ for a raw
 result.]

But if the indexing vector is all NA, you get back a vector of length
of your original vector rather than of your index vector:

  x[c(NA,NA)]
 [1] NA NA NA NA NA NA NA

Maybe it's just me, but I find this surprising, and I can't see it
documented. Bug or undocumented feature? Apologies if I've missed
something obvious.

Barry

 sessionInfo()
R version 2.11.0 alpha (2010-03-25 r51407)
i686-pc-linux-gnu

locale:
 [1] LC_CTYPE=en_GB.UTF-8   LC_NUMERIC=C
 [3] LC_TIME=en_GB.UTF-8LC_COLLATE=en_GB.UTF-8
 [5] LC_MONETARY=C  LC_MESSAGES=en_GB.UTF-8
 [7] LC_PAPER=en_GB.UTF-8   LC_NAME=C
 [9] LC_ADDRESS=C   LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats graphics  grDevices utils datasets  methods   base

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] translating SQL statements into data.table operations

2010-03-25 Thread Matthew Dowle

Nick,

Good question,  but just sent to the wrong place. The posting guide asks you 
to contact the package maintainer first before posting to r-help only if you 
don't hear back. I guess one reason for that is that if questions about all 
2000+ packages were sent to r-help, then r-help's traffic could go through 
the roof.  Another reason could be that some (i.e. maybe many, maybe few) 
package maintainers don't actually monitor r-help and might miss any 
messages you post here.  I only saw this one thanks to google alerts.

Since I'm writing anyway ... are you using the latest version on r-forge 
which has the very fast grouping? Have you set multi-column keys on both edt 
and cdt and tried edt[cdt,roll=TRUE] syntax ?  We'll help you off list to 
climb the learning curve quickly. We are working on FAQs and a vignette and 
they should be ready soon too.

Please do follow up with us (myself and Tom Short cc'd are the main 
developers) off list and one of us will be happy to help further.

Matthew


Nick Switanek nswita...@gmail.com wrote in message 
news:772ec1011003241351v6a3f36efqb0b0787564691...@mail.gmail.com...
 I've recently stumbled across data.table, Matthew Dowle's package. I'm
 impressed by the speed of the package in handling operations with large
 data.frames, but am a bit overwhelmed with the syntax. I'd like to express
 the SQL statement below using data.table operations rather than sqldf 
 (which
 was incredibly slow for a small subset of my financial data) or
 import/export with a DBMS, but I haven't been able to figure out how to do
 it. I would be grateful for your suggestions.

 nick



 My aim is to join events (trades) from two datasets (edt and cdt) 
 where,
 for the same stock, the events in one dataset occur between 15 and 75 days
 before the other, and within the same time window. I can only see how to
 express the WHERE e.SYMBOL = c.SYMBOL part in data.table syntax. I'm 
 also
 at a loss at whether I can express the remainder using data.table's
 %between% operator or not.

 ctqm - sqldf(SELECT e.*,
 c.DATE 'DATEctrl',
 c.TIME 'TIMEctrl',
 c.PRICE 'PRICEctrl',
 c.SIZE 'SIZEctrl'

 FROM edt e, ctq c

 WHERE e.SYMBOL = c.SYMBOL AND
   julianday(e.DATE) - julianday(c.DATE) BETWEEN 15 AND
 75 AND
   strftime('%H:%M:%S',c.TIME) BETWEEN
 strftime('%H:%M:%S',e.BEGTIME) AND strftime('%H:%M:%S',e.ENDTIME))

 [[alternative HTML version deleted]]


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Mosaic

2010-03-24 Thread Matthew Dowle
When you click search on the R homepage, type mosaic into the box, and 
click the button, do the top 3 links seem relevant ?

Your previous 2 requests for help :

26 Feb :  Response was SuppDists. Yet that is the first hit returned by the 
subject line you posted : Hartleys table

22 Feb :  Response was shapiro.test. Yet that is in the second hit returned 
by the subject line you posted : normality in split plot design

Spot the pattern ?


Silvano silv...@uel.br wrote in message 
news:a9322645c4f846a3a6a9daaa8b5a2...@ccepc...
Hi,

I have this data set:

obitoss = c(
5.8,17.4,5.9,17.6,5.8,17.5,4.7,15.8,
3.8,13.4,3.8,13.5,3.7,13.4,3.4,13.6,
4.4,17.3,4.3,17.4,4.2,17.5,4.3,17.0,
4.4,13.6,5.1,14.6,5.7,13.5,3.6,13.3,
6.5,19.6,6.4,19.4,6.3,19.5,6.0,19.7)

(dados = data.frame(
regiao = factor(rep(c('Norte', 'Nordeste', 'Sudeste', 'Sul',
'Centro-Oeste'), each=8)),
ano = factor(rep(c('2000','2001','2002','2003'), each=2)),
sexo = factor(rep(c('F','M'), 4)), resp=obitoss))

I would like to make a mosaic to represent the numeric
variable depending on 3 variables. Does anyone know how to
do?

--
Silvano Cesar da Costa
Departamento de Estatística
Universidade Estadual de Londrina
Fone: 3371-4346

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] If else statements

2010-03-23 Thread Matthew Dowle
Here are some references. Please read these first and post again if you are 
still stuck after reading them. If you do post again, we will need x and y.

1. Introduction to R : 9.2.1 Conditional execution: if statements.
2. R Language Definition : 3.2 Control structures.
3. R for beginners by E Paradis : 6.1 Loops and vectorization
4. Eric Raymond's essay How to Ask Questions The Smart Way 
http://www.catb.org/~esr/faqs/smart-questions.html.

HTH
Matthew


tj girlm...@yahoo.com wrote in message 
news:1269325933723-1678705.p...@n4.nabble.com...

 Hi everyone!
 May I request again for your help?
 I need to make some codes using if else statements...
 Can I do an if-else statement inside an if-else statement? Is this the
 correct form of writing it?
 Thank you.=)

 Example:

 for (v in 1:6) {
 for (i in 2:200) {
 if (v==1)
 (if max(x*v-y*v)1 break())

 if (v==2)
 (if max(x*v-y*v)1.8 break())

 if (v==3)
 (if max(x*v-y*v)2 break())
 }
 }
 -- 
 View this message in context: 
 http://n4.nabble.com/If-else-statements-tp1678705p1678705.html
 Sent from the R help mailing list archive at Nabble.com.


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Forecasting with Panel Data

2010-03-11 Thread Matthew Dowle
Ricardo,

I see you got no public answer so far, on either of the two lists you posted 
to at the same time yesterday.  You are therefore unlikely to ever get a 
reply.

I also see you've been having trouble getting answers in the past, back to 
Nov 09,  at least.  For example no reply to Credit Migration Matrix (Jan 
2010) and no reply to Help with a Loop in function (Nov 2009).

For your information, this is a public place and it took me about 10 seconds 
to assess you. Anyone else on the planet can do this too.

Please read the posting guide AND the links from it, especially the last 
link.  I suggest you read it fully, and slowly.  I think its just that you 
didn't know about it, or somehow missed it by accident.  You were told to 
read it though, at the time you subscribed to this list, at least.  Don't 
worry,  this is not a huge problem. You can build up your reputation again 
very quickly.

With the kindest of regards,

Matthew


Ricardo Gonçalves Silva ricard...@terra.com.br wrote in message 
news:df406bd9dbe644a9b8c0642a3c3f8...@ricardopc...
 Dear Users,

 Can I perform panel data (fixed effects model) out of sample forecasts 
 using R?

 Thanks in advance,

 Ricardo.
 [[alternative HTML version deleted]]


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] speed

2010-03-10 Thread Matthew Dowle
Your choice of subject line alone shows some people that you missed some 
small details from the posting guide. The ability to notice small details 
may be important for you to demonstrate in future.  Any answer in this 
thread is unlikely to be found by a topic search on subject lines alone 
since speed is a single word.

One fast way to increase your reputation is to contribute.  You now have an 
opportunity.  If you follow Jim's good advice, discover the answer for 
yourself, and post it back to the group, changing the subject line so that 
its easier for others to find in future,  thats one way you can contribute 
and increase your reputation.  If you don't do that, thats your choice. It 
is entirely up to you. Whatever action you take next, even doing nothing is 
an action, it is visible in public for everyone to search back and find out 
within seconds.

HTH

Adam Majewski adamm...@o2.pl wrote in message 
news:hn6fp4$2g...@dough.gmane.org...
 Hi,

 I have found some example of R code :
 http://commons.wikimedia.org/wiki/File:Mandelbrot_Creation_Animation_%28800x600%29.gif

 When I run this code on my computer it takes few seconds.

 I wanted to make similar program in Maxima CAS :

 http://thread.gmane.org/gmane.comp.mathematics.maxima.general/29949/focus=29968

 for example :

 f(x,y,n) :=
 block([i:0, c:x+y*%i,ER:4,iMax:n,z:0],
 while abs(z)ER and iiMax
do (z:z*z + c,i:i+1),
 min(ER,abs(z)))$

 wxanimate_draw3d(
n, 5,
enhanced3d=true,
user_preamble=set pm3d at b; set view map,
xu_grid=70,
yv_grid=70,
explicit('f(x,y,n), x, -2, 0.7, y, -1.2, 1.2))$


 But it takes so long to make even one image ( hours)

 What makes the difference, and why R so fast ?

 Regards

 Adam


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Strange result in survey package: svyvar

2010-03-10 Thread Matthew Dowle
This list is the wrong place for that question.  The posting guide tells 
you, in bold, to contact the package maintainer first.

If you had already done that, and didn't hear back from him,  then you 
should tell us,  so that we know you followed the guide.

Corey Sparks corey.spa...@utsa.edu wrote in message 
news:c7bd3ca5.206a%corey.spa...@utsa.edu...
 Hi R users,
 I'm using the survey package to calculate summary statistics for a large
 health survey (the Demographic and Health Survey for Honduras, 2006), and
 when I try to calculate the variances for several variables, I get 
 negative
 numbers.  I thought it may be my data, so I ran the example on the help
 page:

 data(api)
 ## one-stage cluster sample
 dclus1-svydesign(id=~dnum, weights=~pw, data=apiclus1, fpc=~fpc)

 svyvar(~api00+enroll+api.stu+api99, dclus1)
variance SE
 api0011182.8 1386.4
 api0011516.3 1412.9
 api.stu  -4547.1 3164.9
 api9912735.2 1450.1

 If I look at the full matrix for the variances (and covariances):
 test-svyvar(~api00+enroll+api.stu+api99, dclus1)

 print(test, covariance=T)
variance  SE
 api00:api00  11182.8  1386.4
 enroll:api00 -5492.4  3458.1
 api.stu:api00-4547.1  3164.9
 api99:api00  11516.3  1412.9
 api00:enroll -5492.4  3458.1
 enroll:enroll   136424.3 41377.2
 api.stu:enroll  114035.7 34153.9
 api99:enroll -3922.3  3589.9
 api00:api.stu-4547.1  3164.9
 enroll:api.stu  114035.7 34153.9
 api.stu:api.stu  96218.9 28413.7
 api99:api.stu-3060.0  3260.9
 api00:api99  11516.3  1412.9
 enroll:api99 -3922.3  3589.9
 api.stu:api99-3060.0  3260.9
 api99:api99  12735.2  1450.1


 I see that the function is actually returning the covariance for the 
 api.stu
 with the api00 variable.

 I can get the correct variances if I just take
 diag(test)

 But I just was wondering if anyone else was having this problem.  I'm 
 using
 :
 sessionInfo()
 R version 2.10.1 Patched (2009-12-20 r50794)
 x86_64-apple-darwin9.8.0

 locale:
 [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8

 attached base packages:
 [1] stats graphics  grDevices utils datasets  methods   base

 other attached packages:
 [1] survey_3.19

 loaded via a namespace (and not attached):
 [1] tools_2.10.1

 And have the same error on a linux server.

 Thanks,
 Corey
 -- 
 Corey Sparks
 Assistant Professor
 Department of Demography and Organization Studies
 University of Texas at San Antonio
 501 West Durango Blvd
 Monterey Building 2.270C
 San Antonio, TX 78207
 210-458-3166
 corey.sparks 'at' utsa.edu
 https://rowdyspace.utsa.edu/users/ozd504/www/index.htm


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] IMPORTANT - To remove the null elements from a vector

2010-03-09 Thread Matthew Dowle
Welcome to R Barbara.  Its quite an incredible community from all walks of 
life.

Your beginner questions are answered in the manual. See Introduction to R. 
Please read the posting guide again because it contains lots of good advice 
for you. Some people read it three times before posting because they have so 
much respect for the community.  Sometimes they trip up over themselves to 
show they have read it.

Btw - just to let you know that starting your subject lines with IMPORTANT 
is considered by some people a demanding tone for free help. Not everyone, 
but some people. Two posts starting IMPORTANT within 5 minutes is another 
thing that a very large number of people around the world may have just seen 
you do.  I'm just letting you know, in case you were not aware of this.

You received answers from four people who clearly don't mind, and you have 
your answers. Was that your only goal in posting?  Did you consider there 
might be downsides?  This is a public list read by many people and one thing 
the posting guide says is that your questions are saved in the archives 
forever.  Just checking you knew that.  I wouldn't want you to reduce your 
reputation accidentally.  A future employer (it might be a company, or it 
might be a university) anywhere in the world might do a simple search on 
your name, and thats why you might not get an interview, because you had 
showed (in their minds) that you didn't have respect for guidlines. I would 
hate for something like that to happen, all just because you didn't know you 
were supposed to read the posting guide, it wouldn't be fair on you. So it 
would be very unfair of me to know that, and suspect that you don't, but not 
tell you about the posting guide, wouldn't it ?  I hope this information 
helps you.  It is entirely up to you.

r-help is a great way to increase your reputation, but it can reduce your 
reputation too.  By asking great questions, or even contributing, you can 
proudly put that on your CV and increase your chances of getting that 
interview, or getting that position.  I have seen on several CVs from 
students the text please search for my name on r-help.  Just like 
everything you do in public, r-help is very similar. What you write, you 
write in the public domain, and you write it free of charge, and free of 
restriction.

All this applies to all us. When asking for help, and when giving help.

Matthew


barbara.r...@uniroma1.it wrote in message 
news:of1a8063a1.fc14f5ff-onc12576e1.00466053-c12576e1.00466...@uniroma1.it...

 I have a vector that have null elements. How to remove these elements?
 For example:
 x=[10 0 30 40 0 0] I want the vector y=[10 30 40]
 Thanks
 [[alternative HTML version deleted]]


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] fit a gamma pdf using Residual Sum-of-Squares

2010-03-08 Thread Matthew Dowle
Thanks for making it quickly reproducible - I was able to see that message 
in English within a few seconds.
The start has x=86, but the data is also called x.  Remove x=86 from start 
and you get a different error.
P.S. - please do include the R version information. It saves time for us, 
and we like it if you save us time.

vincent laperriere vincent_laperri...@yahoo.fr wrote in message 
news:883644.16455...@web24106.mail.ird.yahoo.com...
Hi all,

I would like to fit a gamma pdf to my data using the method of RSS (Residual 
Sum-of-Squares). Here are the data:

 x - c(86,  90,  94,  98, 102, 106, 110, 114, 118, 122, 126, 130, 134, 138, 
142, 146, 150, 154, 158, 162, 166, 170, 174)
 y - c(2, 5, 10, 17, 26, 60, 94, 128, 137, 128, 77, 68, 65, 60, 51, 26, 17, 
9, 5, 2, 3, 7, 3)

I have typed the following code, using nls method:

fit - nls(y ~ (1/((s^a)*gamma(a))*x^(a-1)*exp(-x/s)), start = c(s=3, a=75, 
x=86))

But I have the following message error (sorry, this is in German):


Fehler in qr(.swts * attr(rhs, gradient)) :
  Dimensionen [Produkt 3] passen nicht zur Länge des Objektes [23]
Zusätzlich: Warnmeldung:
In .swts * attr(rhs, gradient) : Länge des längeren Objektes
  ist kein Vielfaches der Länge des kürzeren Objektes

Could anyone help me with the code?
I would greatly appreciate it.
Sincerely yours,
Vincent Laperrière.



[[alternative HTML version deleted]]









__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] ifthen() question

2010-03-05 Thread Matthew Dowle
This post breaks the posting guide in multiple ways.  Please read it again 
(and then again) - in particular the first 3 paragraphs.  You will help 
yourself by following it.

The solution is right there in the help page for ?data.frame and other 
places including Introduction to R.  I think its more helpful *not* to tell 
you what it is, so that you discover it for yourself, learn how to learn, 
and google.   I hope that you appreciate it that I've been helpful just 
simply (and quickly) telling you the answer *is* there.

Having said that, you don't appear to be aware of many of the packages 
around that does this task - you appear to be re-inventing the wheel.  I 
suggest you briefly investigate each and every one of the top 30 packages 
ranked by crantastic, before writing any more R code.  A little time 
invested doing that will pay you dividends in the long run. That is not a 
complaint of you though, as that advice is not in the posting guide.

Matthew


AC Del Re de...@wisc.edu wrote in message 
news:85cf8f8d1003040735k2b076142jc99b7ec34da87...@mail.gmail.com...
 Hi All,

 I am using a specialized aggregation function to reduce a dataset with
 multiple rows per id down to 1 row per id. My function work perfect when
 there are 1 id but alters the 'var.g' in undesirable ways when this
 condition is not met, Therefore, I have been trying ifthen() statements to
 keep the original value when length of unique id == 1 but I cannot get it 
 to
 work. e.g.:

 #function to aggregate effect sizes:
 aggs - function(g, n.1, n.2, cor = .50) {
  n.1 - mean(n.1)
  n.2 - mean(n.2)
  N_ES - length(g)
  corr.mat - matrix (rep(cor, N_ES^2), nrow=N_ES)
  diag(corr.mat) - 1
  g1g2 - cbind(g) %*% g
  PSI - (8*corr.mat + g1g2*corr.mat^2)/(2*(n.1+n.2))
  PSI.inv - solve(PSI)
  a - rowSums(PSI.inv)/sum(PSI.inv)
  var.g - 1/sum(PSI.inv)
  g - sum(g*a)
  out-cbind(g,var.g, n.1, n.2)
  return(out)
  }


 # automating this procedure for all rows of df. This format works perfect
 when there is  1 id per row only:

 agg_g - function(id, g, n.1, n.2, cor = .50) {
  st - unique(id)
  out - data.frame(id=rep(NA,length(st)))
  for(i in 1:length(st))   {
out$id[i] - st[i]
out$g[i] - aggs(g=g[id==st[i]], n.1= n.1[id==st[i]],
   n.2 = n.2[id==st[i]], cor)[1]
out$var.g[i] - aggs(g=g[id==st[i]], n.1= n.1[id==st[i]],
  n.2 = n.2[id==st[i]], cor)[2]
out$n.1[i] - round(mean(n.1[id==st[i]]),0)
out$n.2[i] - round(mean(n.2[id==st[i]]),0)
  }
  return(out)
 }


 # The attempted solution using ifthen() and minor changes to function but
 it's not working properly:
 agg_g - function(df,var.g, id, g, n.1, n.2, cor = .50) {
  df$var.g - var.g
  st - unique(id)
  out - data.frame(id=rep(NA,length(st)))
  for(i in 1:length(st))   {
out$id[i] - st[i]
out$g[i] - aggs(g=g[id==st[i]], n.1= n.1[id==st[i]],
   n.2 = n.2[id==st[i]], cor)[1]
out$var.g[i]-ifelse(length(st[i])==1, df$var.g[id==st[i]],
 aggs(g=g[id==st[i]],
  n.1= n.1[id==st[i]],
  n.2 = n.2[id==st[i]], cor)[2])
out$n.1[i] - round(mean(n.1[id==st[i]]),0)
out$n.2[i] - round(mean(n.2[id==st[i]]),0)
  }
  return(out)
 }

 # sample data:
 id-c(1, rep(1:19))
 n.1-c(10,20,13,22,28,12,12,36,19,12,36,75,33,121,37,14,40,16,14,20)
 n.2 - c(11,22,10,20,25,12,12,36,19,11,34,75,33,120,37,14,40,16,10,21)
 g - c(.68,.56,.23,.64,.49,-.04,1.49,1.33,.58,1.18,-.11,1.27,.26,.40,.49,
 .51,.40,.34,.42,1.16)
 var.g -
 c(.08,.06,.03,.04,.09,.04,.009,.033,.0058,.018,.011,.027,.026,.0040,
 .049,.0051,.040,.034,.0042,.016)
 df-data.frame(id, n.1,n.2, g, var.g)

 Any help is much appreciated,

 AC

 [[alternative HTML version deleted]]


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Nonparametric generalization of ANOVA

2010-03-05 Thread Matthew Dowle
Frank, I respect your views but I agree with Gabor.  The posting guide does 
not support your views.

It is not any of our views that are important but we are following the 
posting guide.  It covers affiliation. It says only that some consider it 
good manners to include a concise signature specifying affiliation. It 
does not agree that it is bad manners not to.  It is therefore going too far 
to urge R-gurus, whoever they might be, to ignore such postings on that 
basis alone.  It is up to responders (I think that is the better word which 
is the one used by the posting guide) whether they reply.  Missing 
affiliation is ok by the posting guide.  Users shouldn't be put off from 
posting because of that alone.

Sending from an anonymous email address such as BioStudent is also fine by 
the posting guide as far as my eyes read it. It says only that the email 
address should work. I would also answer such anonymous posts, providing 
they demonstrate they made best efforts to follow the posting guide, as 
usual for all requests for help.  Its so easy to send from a false, but 
apparently real name, why worry about that?

If you disagree with the posting guide then you could make a suggestion to 
get the posting guide changed with respect to these points.  But, currently, 
good and practice is defined by the posting guide, and I can't see that your 
view is backed up by it.  In fact it seems to me that these points were 
carefully considered, and the wording is careful on these points.

As far as I know you are wrong that there is no moderator.  There are in 
fact an uncountable number of people who are empowered to moderate i.e. all 
of us. In other words its up to the responders to moderate.  The posting 
guide is our guide.  As a last resort we can alert the list administrator 
(which I believe is the correct name for him in that role), who has powers 
to remove an email address from the list if he thinks that is appropriate, 
or act otherwise, or not at all.  It is actually up to responders (i.e. all 
of us) to ensure the posting guide is followed.

My view is that the problems started with some responders on some occasions. 
They sometimes forgot, a little bit, to encourage and remind posters to 
follow the posting guide when it was not followed. This then may have 
encouraged more posters to think it was ok not to follow the posting guide. 
That is my own personal view,  not a statistical one backed up by any 
evidence.

Matthew


Frank E Harrell Jr f.harr...@vanderbilt.edu wrote in message 
news:4b913880.9020...@vanderbilt.edu...
 Gabor Grothendieck wrote:
 I am happy to answer posts to r-help regardless of the name and email
 address of the poster but would draw the line at someone excessively
 posting without a reasonable effort to find the answer first or using
 it for homework since such requests could flood the list making it
 useless for everyone.

 Gabor I respectfully disagree.  It is bad practice to allow anonymous 
 postings.  We need to see real names and real affiliations.

 r-help is starting to border on uselessness because of the age old problem 
 of the same question being asked every two days, a high frequency of 
 specialty questions, and answers given with the best of intentions in 
 incremental or contradictory e-mail pieces (as opposed to a cumulative 
 wiki or hierarchically designed discussion web forum), as there is no 
 moderator for the list.  We don't need even more traffic from anonymous 
 postings.

 Frank


 On Fri, Mar 5, 2010 at 10:55 AM, Ravi Varadhan rvarad...@jhmi.edu 
 wrote:
 David,

 I agree with your sentiments.  I also think that it is bad posting 
 etiquette not to sign one's genuine name and affiliation when asking for 
 help, which blue sky seems to do a lot.  Bert Gunter has already 
 raised this issue, and I completely agree with him. I would also like to 
 urge the R-gurus to ignore such postings.

 Best,
 Ravi.
 

 Ravi Varadhan, Ph.D.
 Assistant Professor,
 Division of Geriatric Medicine and Gerontology
 School of Medicine
 Johns Hopkins University

 Ph. (410) 502-2619
 email: rvarad...@jhmi.edu


 - Original Message -
 From: David Winsemius dwinsem...@comcast.net
 Date: Friday, March 5, 2010 9:25 am
 Subject: Re: [R] Nonparametric generalization of ANOVA
 To: blue sky bluesky...@gmail.com
 Cc: r-h...@stat.math.ethz.ch


  On Mar 5, 2010, at 8:19 AM, blue sky wrote:

   My interpretation of the relation between 1-way ANOVA and Wilcoxon's
   test (wilcox.test() in R) is the following.
  
   1-way ANOVA is to test if two or multiple distributions are the 
 same,
   assuming all the distributions are normal and have equal variances.
   Wilcoxon's test is to test two distributions are the same without
   assuming what their distributions are.
  
   In this sense, I'm wondering what is the generalization of 
 Wilcoxon's
   test to more than two distributions. And, more general, what 

Re: [R] Nonparametric generalization of ANOVA

2010-03-05 Thread Matthew Dowle
John,

So you want BlueSky to change their name to Paul Smith at New York 
University,   just to give a totally random, false name, example,  and then 
you will be happy ?  I just picked a popular, real name at a real, big 
place.   Are you, or is anyone else,  going to check its real ?

We want BlueSky to ask great questions,  which haven't been asked before, 
and to follow the posting guide.  If BlueSky improves the knowledge base 
whats the problem?  This person may well be breaking the posting guide for 
many other reasons  (I haven't looked), and if they are then you could take 
issue with them on those points, but not for simply writing as BlueSky.

David W has got it right when he replied to ManInMoon.   Shall we stop 
this thread now,  and follow his lead ?   I would have picked ManOnMoon 
myself but maybe that one was taken. Its rather difficult to be on a moon, 
let alone inside it.

Matthew


John Sorkin jsor...@grecc.umaryland.edu wrote in message 
news:4b91068702cb00064...@medicine.umaryland.edu...
 The sad part of this interchanges is that Blue Sky does not seem to be 
 amiable to suggestion. He, or she, has not taken note, or responded to the 
 fact that a number of people believe it is good manners to give a real 
 name and affiliation. My mother taught me that when two people tell you 
 that you are drunk you should lie down until the inebriation goes away. 
 Blue Sky, several people have noted that you would do well to give us your 
 name and affiliation. Is this too much to ask given that people are good 
 enough to help you?
 John




 John David Sorkin M.D., Ph.D.
 Chief, Biostatistics and Informatics
 University of Maryland School of Medicine Division of Gerontology
 Baltimore VA Medical Center
 10 North Greene Street
 GRECC (BT/18/GR)
 Baltimore, MD 21201-1524
 (Phone) 410-605-7119
 (Fax) 410-605-7913 (Please call phone number above prior to faxing) 
 Matthew Dowle mdo...@mdowle.plus.com 3/5/2010 12:58 PM 
 Frank, I respect your views but I agree with Gabor.  The posting guide 
 does
 not support your views.

 It is not any of our views that are important but we are following the
 posting guide.  It covers affiliation. It says only that some consider 
 it
 good manners to include a concise signature specifying affiliation. It
 does not agree that it is bad manners not to.  It is therefore going too 
 far
 to urge R-gurus, whoever they might be, to ignore such postings on that
 basis alone.  It is up to responders (I think that is the better word 
 which
 is the one used by the posting guide) whether they reply.  Missing
 affiliation is ok by the posting guide.  Users shouldn't be put off from
 posting because of that alone.

 Sending from an anonymous email address such as BioStudent is also fine 
 by
 the posting guide as far as my eyes read it. It says only that the email
 address should work. I would also answer such anonymous posts, providing
 they demonstrate they made best efforts to follow the posting guide, as
 usual for all requests for help.  Its so easy to send from a false, but
 apparently real name, why worry about that?

 If you disagree with the posting guide then you could make a suggestion to
 get the posting guide changed with respect to these points.  But, 
 currently,
 good and practice is defined by the posting guide, and I can't see that 
 your
 view is backed up by it.  In fact it seems to me that these points were
 carefully considered, and the wording is careful on these points.

 As far as I know you are wrong that there is no moderator.  There are in
 fact an uncountable number of people who are empowered to moderate i.e. 
 all
 of us. In other words its up to the responders to moderate.  The posting
 guide is our guide.  As a last resort we can alert the list administrator
 (which I believe is the correct name for him in that role), who has powers
 to remove an email address from the list if he thinks that is appropriate,
 or act otherwise, or not at all.  It is actually up to responders (i.e. 
 all
 of us) to ensure the posting guide is followed.

 My view is that the problems started with some responders on some 
 occasions.
 They sometimes forgot, a little bit, to encourage and remind posters to
 follow the posting guide when it was not followed. This then may have
 encouraged more posters to think it was ok not to follow the posting 
 guide.
 That is my own personal view,  not a statistical one backed up by any
 evidence.

 Matthew


 Frank E Harrell Jr f.harr...@vanderbilt.edu wrote in message
 news:4b913880.9020...@vanderbilt.edu...
 Gabor Grothendieck wrote:
 I am happy to answer posts to r-help regardless of the name and email
 address of the poster but would draw the line at someone excessively
 posting without a reasonable effort to find the answer first or using
 it for homework since such requests could flood the list making it
 useless for everyone.

 Gabor I respectfully disagree.  It is bad practice to allow anonymous
 postings.  We

Re: [R] data.table evaluating columns

2010-03-03 Thread Matthew Dowle

I'd go a bit further and remind that the r-help posting guide is clear :

   For questions about functions in standard packages distributed with R 
(see the FAQ Add-on packages in R), ask questions on R-help.
If the question relates to a contributed package , e.g., one downloaded from 
CRAN, try contacting the package maintainer first. You can also use 
find(functionname) and packageDescription(packagename) to find this 
information. ONLY send such questions to R-help or R-devel if you get no 
reply or need further assistance. This applies to both requests for help and 
to bug reports. 

The ONLY is in bold in the posting guide. I changed the bold to capitals 
above for people reading this in text only.

Since Tom and I are friendly and responsive, users of data.table don't 
usually make it to r-help. We'll follow up this one off-list.  Please note 
that Rob's question is very good by the rest of the posting guide, so no 
complaints there, only that it was sent to the wrong place.  Please keep the 
questions coming, but send them to us, not r-help.

You do sometimes see messages to r-help starting something like I have 
contacted the authors/maintainers but didn't hear back,  does anyone know 
   To not state that they had would be an implicit request for further 
work by the community (for free) to ask if they had. So its not enough to 
contact the maintainer first, but you also have to say that you have as 
well, and perhaps how long ago too would be helpful.  For r-forge projects I 
usually send any question to everyone on the project (easy to find) or if 
they have a list then to that.

HTH
Matthew


Tom Short tshort.rli...@gmail.com wrote in message 
news:fd27013a1003021718w409acb32r1281dfeca5593...@mail.gmail.com...
On Tue, Mar 2, 2010 at 7:09 PM, Rob Forler rfor...@uchicago.edu wrote:
 Hi everyone,

 I have the following code that works in data frames taht I would like tow
 ork in data.tables . However, I'm not really sure how to go about it.

 I basically have the following

 names = c(data1, data2)
 frame = data.frame(list(key1=as.integer(c(1,2,3,4,5,6)),
 key2=as.integer(c(1,2,3,2,5,6)),data1 = c(3,3,2,3,5,2), data2=
 c(3,3,2,3,5,2)))

 for(i in 1:length(names)){
 frame[, paste(names[i], flag)] = frame[,names[i]]  3

 }

 Now I try with data.table code:
 names = c(data1, data2)
 frame = data.table(list(key1=as.integer(c(1,2,3,4,5,6)),
 key2=as.integer(c(1,2,3,2,5,6)),data1 = c(3,3,2,3,5,2), data2=
 c(3,3,2,3,5,2)))

 for(i in 1:length(names)){
 frame[, paste(names[i], flag), with=F] = as.matrix(frame[,names[i],
 with=F] ) 3

 }

Rob, this type of question is better for the package maintainer(s)
directly rather than R-help. That said, one answer is to use list
addressing:

for(i in 1:length(names)){
frame[[paste(names[i], flag)]] = frame[[names[i]]]  3
}

Another option is to manipulate frame as a data frame and convert to
data.table when you need that functionality (conversion is quick).

In the data table version, frame[,names[i], with=F] is the same as
frame[,names[i], drop=FALSE] (the answer is a list, not a vector).
Normally, it's easier to use [[]] or $ indexing to get this. Also,
fname[i,j] - something assignment is still a bit buggy for
data.tables.

- Tom

Tom Short

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] data.table evaluating columns

2010-03-03 Thread Matthew Dowle
That in itself is a question for the maintainer, off r-help. When the 
posting guide says contact the package maintainer first it means it 
literally and applies even to questions about the existence of a mailing 
list for the package.  So what I'm supposed to do now is tell you how the 
posting guide works, and tell you that I'll reply off list.  Then hopefully 
the community will be happy with me too.  So I'll reply off list :-)

Rob Forler rfor...@uchicago.edu wrote in message 
news:eb472fec1003030502s4996511ap8dfd329a3...@mail.gmail.com...
 Okay I appreciate the help, and I appreciate the FAQ reminder. I will read
 the r-help posting guide. I'm relatively new to using the support systems
 around R. So far everyone has been really helpful.

 I'm confused as to which data.table list I should be using.
 http://lists.r-forge.r-project.org/pipermail/datatable-commits/ doesn't
 appear to be correct. Or just directly sending an email to all of you?

 Thanks again,
 Rob



 On Wed, Mar 3, 2010 at 6:05 AM, Matthew Dowle 
 mdo...@mdowle.plus.comwrote:


 I'd go a bit further and remind that the r-help posting guide is clear :

   For questions about functions in standard packages distributed with R
 (see the FAQ Add-on packages in R), ask questions on R-help.
 If the question relates to a contributed package , e.g., one downloaded
 from
 CRAN, try contacting the package maintainer first. You can also use
 find(functionname) and packageDescription(packagename) to find this
 information. ONLY send such questions to R-help or R-devel if you get no
 reply or need further assistance. This applies to both requests for help
 and
 to bug reports. 

 The ONLY is in bold in the posting guide. I changed the bold to 
 capitals
 above for people reading this in text only.

 Since Tom and I are friendly and responsive, users of data.table don't
 usually make it to r-help. We'll follow up this one off-list.  Please 
 note
 that Rob's question is very good by the rest of the posting guide, so no
 complaints there, only that it was sent to the wrong place.  Please keep
 the
 questions coming, but send them to us, not r-help.

 You do sometimes see messages to r-help starting something like I have
 contacted the authors/maintainers but didn't hear back,  does anyone know
    To not state that they had would be an implicit request for 
 further
 work by the community (for free) to ask if they had. So its not enough to
 contact the maintainer first, but you also have to say that you have as
 well, and perhaps how long ago too would be helpful.  For r-forge 
 projects
 I
 usually send any question to everyone on the project (easy to find) or if
 they have a list then to that.

 HTH
 Matthew


 Tom Short tshort.rli...@gmail.com wrote in message
 news:fd27013a1003021718w409acb32r1281dfeca5593...@mail.gmail.com...
 On Tue, Mar 2, 2010 at 7:09 PM, Rob Forler rfor...@uchicago.edu wrote:
  Hi everyone,
 
  I have the following code that works in data frames taht I would like 
  tow
  ork in data.tables . However, I'm not really sure how to go about it.
 
  I basically have the following
 
  names = c(data1, data2)
  frame = data.frame(list(key1=as.integer(c(1,2,3,4,5,6)),
  key2=as.integer(c(1,2,3,2,5,6)),data1 = c(3,3,2,3,5,2), data2=
  c(3,3,2,3,5,2)))
 
  for(i in 1:length(names)){
  frame[, paste(names[i], flag)] = frame[,names[i]]  3
 
  }
 
  Now I try with data.table code:
  names = c(data1, data2)
  frame = data.table(list(key1=as.integer(c(1,2,3,4,5,6)),
  key2=as.integer(c(1,2,3,2,5,6)),data1 = c(3,3,2,3,5,2), data2=
  c(3,3,2,3,5,2)))
 
  for(i in 1:length(names)){
  frame[, paste(names[i], flag), with=F] = as.matrix(frame[,names[i],
  with=F] ) 3
 
  }

 Rob, this type of question is better for the package maintainer(s)
 directly rather than R-help. That said, one answer is to use list
 addressing:

 for(i in 1:length(names)){
frame[[paste(names[i], flag)]] = frame[[names[i]]]  3
 }

 Another option is to manipulate frame as a data frame and convert to
 data.table when you need that functionality (conversion is quick).

 In the data table version, frame[,names[i], with=F] is the same as
 frame[,names[i], drop=FALSE] (the answer is a list, not a vector).
 Normally, it's easier to use [[]] or $ indexing to get this. Also,
 fname[i,j] - something assignment is still a bit buggy for
 data.tables.

 - Tom

 Tom Short

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.


 [[alternative HTML version deleted]]


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Three most useful R package

2010-03-03 Thread Matthew Dowle
Dieter,

One way to check if a package is active, is by looking on r-forge. If you 
are referring to data.table you would have found it is actually very active 
at the moment and is far from abandoned.

What you may be referring to is a warning, not an error, with v1.2 on 
R2.10+.  That was fixed many moons ago. The r-forge version is where its at.

Rather than commenting in public about a warning on a package, and making a 
conclusion about its abandonment, and doing this without copying the 
maintainer, perhaps you could have contacted the maintainer to let him know 
you had found a problem.  That would have been a more community spirited 
action to take.  Doing that at the time you found out would have been 
helpful too rather than saving it up for now.  Or you can always check the 
svn logs yourself,  as the r-forge guys even made that trivial to do.

All,

Can we please now stop this thread ?  The crantastic people worked hard to 
provide a better solution.  If the community refuses to use crantastic, 
thats up to the community, but to start now filling up r-help with votes on 
packages when so much effort was put in to a much much better solution ages 
ago?  Its as quick to put your votes into crantastic as it is to write to 
r-help.  What your problem, folks, with crantastic?   The second reply 
mentioned crantastic but you all chose to ignore it,  it seems.  If you want 
to vote, use crantastic.  If you don't want to vote,  don't vote.  But using 
r-help to vote ?!  The better solution is right there: 
http://crantastic.org/

Matthew


Dieter Menne dieter.me...@menne-biomed.de wrote in message 
news:1267626882999-1576618.p...@n4.nabble.com...


 Rob Forler wrote:

 And data.table because it does aggregation about 50x times faster than
 plyr
 (which I used to use a lot).



 This is correct, from the error message its spits out one has to conclude
 that is was abandoned at R-version 2.4.x

 Dieter




 -- 
 View this message in context: 
 http://n4.nabble.com/Three-most-useful-R-package-tp1575671p1576618.html
 Sent from the R help mailing list archive at Nabble.com.


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Reading large files

2010-02-05 Thread Matthew Dowle
I agree with Jim.  The term do analysis is almost meaningless, the posting 
guide makes reference to statements such as that. At least he tried to 
define large, but inconsistenly (first of all 850MB, then changed to 
10-20-15GB).

 Satish wrote: at one time I will need to load say 15GB into R

Assuming the user is always right then, here is some information :

R has been 64bit on unix for a very long time (over a decade).  64bit R is 
also available for Win64.
It uses as much RAM you install on the box, e.g. 64GB.
Yes R users do that,  and they've been doing that for years and years.
The data.table package was mainly designed for 64bit,  although its a point 
of consternation when people think thats all its useful for.
If you don't have the hardware, then you can rent the time on EC2. There are 
tools and packages to make that easy e.g. pre-built images you can just use. 
Look at the HPC task view. Search the archives. Don't miss Biocep at 
http://biocep-distrib.r-forge.r-project.org/doc.html.

Albert Einstein said A clever person solves a problem. A wise person avoids 
it..   So an option for you is to be wise and move to 64bit.


jim holtman jholt...@gmail.com wrote in message 
news:644e1f321002050513y242304der84b5674930b54...@mail.gmail.com...
 Where should be shine it?  No information provided on operating
 system, version, memory, size of files, what you want to do with them,
 etc.  Lot of options: put it in a database, read partial file (lines
 and/or columns), preprocess, etc.  Your option.

 On Fri, Feb 5, 2010 at 8:03 AM, Satish Vadlamani
 satish.vadlam...@fritolay.com wrote:

 Folks:
 Can anyone throw some light on this? Thanks.
 Satish


 -
 Satish Vadlamani
 --
 View this message in context: 
 http://n4.nabble.com/Reading-large-files-tp1469691p1470169.html
 Sent from the R help mailing list archive at Nabble.com.

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide 
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.




 -- 
 Jim Holtman
 Cincinnati, OH
 +1 513 646 9390

 What is the problem that you are trying to solve?


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Reading large files

2010-02-05 Thread Matthew Dowle
I can't help you further than whats already been posted to you. Maybe 
someone else can.
Best of luck.

Satish Vadlamani satish.vadlam...@fritolay.com wrote in message 
news:1265397089104-1470667.p...@n4.nabble.com...

 Matthew:
 If it is going to help, here is the explanation. I have an end state in
 mind. It is given below under End State header. In order to get there, I
 need to start somewhere right? I started with a 850 MB file and could not
 load in what I think is reasonable time (I waited for an hour).

 There are references to 64 bit. How will that help? It is a 4GB RAM 
 machine
 and there is no paging activity when loading the 850 MB file.

 I have seen other threads on the same types of questions. I did not see 
 any
 clear cut answers or errors that I could have been making in the process. 
 If
 I am missing something, please let me know. Thanks.
 Satish


 End State
 Satish wrote: at one time I will need to load say 15GB into R


 -
 Satish Vadlamani
 -- 
 View this message in context: 
 http://n4.nabble.com/Reading-large-files-tp1469691p1470667.html
 Sent from the R help mailing list archive at Nabble.com.


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] merging columns

2010-02-03 Thread Matthew Dowle
Yes.
data.df[,wcol,drop=FALSE]
For an explanation of drop see ?[.data.frame

Chuck White chuckwhi...@charter.net wrote in message 
news:20100202212800.o8xbu.681696.r...@mp11...
 Additional clarification: the problem only comes when you have one column 
 selected from the original dataframe.  You need to make the following 
 modification to the original example:

 data.df - data.frame(aa=c(1,1,0), cc=c(1,0,0), aab=c(0,1,0), 
 aac=c(0,0,1), bb=c(1,0,1))

 And, the following seems to work:
 data.frame(sapply(col2.uniq, function(col) {
  wcol - which(col==col2)
  as.numeric(rowSums(data.frame(data.df[,wcol]))0)
 }))
 I had to wrap data.df[,wcol] in another data.frame to handle situations 
 where wcol had one element. Is there a better approach?


  Chuck White chuckwhi...@charter.net wrote:
 Hello -- I am trying to merge columns in a dataframe based on substring 
 matches in colnames. I would appreciate if somebody can suggest a 
 faster/cleaner approach (eg. I would have really liked to avoid the 
 if-else piece but rowSums does not like that). Thanks.

 data.df - data.frame(aa=c(1,1,0), bbcc=c(1,0,0), aab=c(0,1,0), 
 aac=c(0,0,1), bbk=c(1,0,1))
 col2 - substr(colnames(data.df),1,2)

 col2.uniq - unique(col2)
 names(col2.uniq) - col2.uniq

 data.frame(sapply(col2.uniq, function(col) {
   wcol - which(col==col2)
   if(length(wcol)1) {
 tmp - rowSums(data.df[,wcol])
   } else {
 tmp - data.df[,wcol]
   }
   as.numeric(tmp0)
 }))


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] RMySQL - Bulk loading data and creating FK links

2010-01-28 Thread Matthew Dowle
How it represents data internally is very important,  depending on the real 
goal :
http://en.wikipedia.org/wiki/Column-oriented_DBMS


Gabor Grothendieck ggrothendi...@gmail.com wrote in message 
news:971536df1001271710o4ea62333l7f1230b860114...@mail.gmail.com...
How it represents data internally should not be important as long as
you can do what you want.  SQL is declarative so you just specify what
you want rather than how to get it and invisibly to the user it
automatically draws up a query plan and then uses that plan to get the
result.

On Wed, Jan 27, 2010 at 12:48 PM, Matthew Dowle mdo...@mdowle.plus.com 
wrote:

 sqldf(select * from BOD order by Time desc limit 3)
 Exactly. SQL requires use of order by. It knows the order, but it isn't
 ordered. Thats not good, but might be fine, depending on what the real 
 goal
 is.


 Gabor Grothendieck ggrothendi...@gmail.com wrote in message
 news:971536df1001270629w4795da89vb7d77af6e4e8b...@mail.gmail.com...
 On Wed, Jan 27, 2010 at 8:56 AM, Matthew Dowle mdo...@mdowle.plus.com
 wrote:
 How many columns, and of what type are the columns ? As Olga asked too, 
 it
 would be useful to know more about what you're really trying to do.

 3.5m rows is not actually that many rows, even for 32bit R. Its depends 
 on
 the columns and what you want to do with those columns.

 At the risk of suggesting something before we know the full facts, one
 possibility is to load the data from flat file into data.table. Use
 setkey()
 to set your keys. Use tables() to summarise your various tables. Then do
 your joins etc all-in-R. data.table has fast ways to do those sorts of
 joins (but we need more info about your task).

 Alternatively, you could check out the sqldf website. There is an
 sqlread.csv (or similar name) which can read your files directly into SQL

 read.csv.sql

 instead of going via R. Gabor has some nice examples there about that and
 its faster.

 You use some buzzwords which makes me think that SQL may not be
 appropriate
 for your task though. Can't say for sure (because we don't have enough
 information) but its possible you are struggling because SQL has no row
 ordering concept built in. That might be why you've created an increment

 In the SQLite database it automatically assigns a self incrementing
 hidden column called rowid to each row. e.g. using SQLite via the
 sqldf package on CRAN and the BOD data frame which is built into R we
 can display the rowid column explicitly by referring to it in our
 select statement:

 library(sqldf)
 BOD
 Time demand
 1 1 8.3
 2 2 10.3
 3 3 19.0
 4 4 16.0
 5 5 15.6
 6 7 19.8
 sqldf(select rowid, * from BOD)
 rowid Time demand
 1 1 1 8.3
 2 2 2 10.3
 3 3 3 19.0
 4 4 4 16.0
 5 5 5 15.6
 6 6 7 19.8


 field? Do your queries include order by incrementing field? SQL is not
 good at first and last type logic. An all-in-R solution may well be

 In SQLite you can get the top 3 values, say, like this (continuing the
 prior example):

 sqldf(select * from BOD order by Time desc limit 3)
 Time demand
 1 7 19.8
 2 5 15.6
 3 4 16.0

 better, since R is very good with ordered vectors. A 1GB data.table (or
 data.frame) for example, at 3.5m rows, could have 76 integer columns, or
 38 double columns. 1GB is well within 32bit and allows some space for
 working copies, depending on what you want to do with the data. If you
 have
 38 or less columns, or you have 64bit, then an all-in-R solution *might*
 get your task done quicker, depending on what your real goal is.

 If this sounds plausible, you could post more details and, if its
 appropriate, and luck is on your side, someone might even sketch out how
 to
 do an all-in-R solution.


 Nathan S. Watson-Haigh nathan.watson-ha...@csiro.au wrote in message
 news:4b5fde1b.10...@csiro.au...
I have a table (contact) with several fields and it's PK is an auto
increment field. I'm bulk loading data to this table from files which if
successful will be about 3.5million rows (approx 16000 rows per file).
However, I have a linking table (an_contact) to resolve a m:m 
relationship
between the an and contact tables. How can I retrieve the PK's for the
data
bulk loaded into contact so I can insert the relevant data into
an_contact.

 I currently load the data into contact using:
 dbWriteTable(con, contact, dat, append=TRUE, row.names=FALSE)

 But I then need to get all the PK's which this dbWriteTable() appended 
 to
 the contact table so I can load the data into my an_contact link table. 
 I
 don't want to issue a separate INSERT query for each row in dat and then
 use MySQLs LAST_INSERT_ID() functionnot when I have 3.5million rows
 to
 insert!

 Any pointers welcome,
 Nathan

 --
 
 Dr. Nathan S. Watson-Haigh
 OCE Post Doctoral Fellow
 CSIRO Livestock Industries
 University Drive
 Townsville, QLD 4810
 Australia

 Tel: +61 (0)7 4753 8548
 Fax: +61 (0)7 4753 8600
 Web: http://www.csiro.au/people/Nathan.Watson-Haigh.html

Re: [R] RMySQL - Bulk loading data and creating FK links

2010-01-28 Thread Matthew Dowle
Are you claiming that SQL is that utopia?  SQL is a row store.  It cannot 
give the user the benefits of column store.

For example, why does SQL take 113 seconds in the example in this thread :
http://tolstoy.newcastle.edu.au/R/e9/help/10/01/1872.html
but data.table takes 5 seconds to get the same result ? How come the high 
level language SQL doesn't appear to hide the user from this detail ?

If you are just describing utopia, then of course I agree.  It would be 
great to have a language which hid us from this.  In the meantime the user 
has choices, and the best choice depends on the task and the real goal.

Gabor Grothendieck ggrothendi...@gmail.com wrote in message 
news:971536df1001280428p345f8ff4v5f3a80c13f96d...@mail.gmail.com...
Its only important internally.  Externally its undesirable that the
user have to get involved in it.  The idea of making software easy to
write and use is to hide the implementation and focus on the problem.
That is why we use high level languages, object orientation, etc.

On Thu, Jan 28, 2010 at 4:37 AM, Matthew Dowle mdo...@mdowle.plus.com 
wrote:
 How it represents data internally is very important, depending on the real
 goal :
 http://en.wikipedia.org/wiki/Column-oriented_DBMS


 Gabor Grothendieck ggrothendi...@gmail.com wrote in message
 news:971536df1001271710o4ea62333l7f1230b860114...@mail.gmail.com...
 How it represents data internally should not be important as long as
 you can do what you want. SQL is declarative so you just specify what
 you want rather than how to get it and invisibly to the user it
 automatically draws up a query plan and then uses that plan to get the
 result.

 On Wed, Jan 27, 2010 at 12:48 PM, Matthew Dowle mdo...@mdowle.plus.com
 wrote:

 sqldf(select * from BOD order by Time desc limit 3)
 Exactly. SQL requires use of order by. It knows the order, but it isn't
 ordered. Thats not good, but might be fine, depending on what the real
 goal
 is.


 Gabor Grothendieck ggrothendi...@gmail.com wrote in message
 news:971536df1001270629w4795da89vb7d77af6e4e8b...@mail.gmail.com...
 On Wed, Jan 27, 2010 at 8:56 AM, Matthew Dowle mdo...@mdowle.plus.com
 wrote:
 How many columns, and of what type are the columns ? As Olga asked too,
 it
 would be useful to know more about what you're really trying to do.

 3.5m rows is not actually that many rows, even for 32bit R. Its depends
 on
 the columns and what you want to do with those columns.

 At the risk of suggesting something before we know the full facts, one
 possibility is to load the data from flat file into data.table. Use
 setkey()
 to set your keys. Use tables() to summarise your various tables. Then do
 your joins etc all-in-R. data.table has fast ways to do those sorts of
 joins (but we need more info about your task).

 Alternatively, you could check out the sqldf website. There is an
 sqlread.csv (or similar name) which can read your files directly into 
 SQL

 read.csv.sql

 instead of going via R. Gabor has some nice examples there about that 
 and
 its faster.

 You use some buzzwords which makes me think that SQL may not be
 appropriate
 for your task though. Can't say for sure (because we don't have enough
 information) but its possible you are struggling because SQL has no row
 ordering concept built in. That might be why you've created an increment

 In the SQLite database it automatically assigns a self incrementing
 hidden column called rowid to each row. e.g. using SQLite via the
 sqldf package on CRAN and the BOD data frame which is built into R we
 can display the rowid column explicitly by referring to it in our
 select statement:

 library(sqldf)
 BOD
 Time demand
 1 1 8.3
 2 2 10.3
 3 3 19.0
 4 4 16.0
 5 5 15.6
 6 7 19.8
 sqldf(select rowid, * from BOD)
 rowid Time demand
 1 1 1 8.3
 2 2 2 10.3
 3 3 3 19.0
 4 4 4 16.0
 5 5 5 15.6
 6 6 7 19.8


 field? Do your queries include order by incrementing field? SQL is not
 good at first and last type logic. An all-in-R solution may well be

 In SQLite you can get the top 3 values, say, like this (continuing the
 prior example):

 sqldf(select * from BOD order by Time desc limit 3)
 Time demand
 1 7 19.8
 2 5 15.6
 3 4 16.0

 better, since R is very good with ordered vectors. A 1GB data.table (or
 data.frame) for example, at 3.5m rows, could have 76 integer columns, or
 38 double columns. 1GB is well within 32bit and allows some space for
 working copies, depending on what you want to do with the data. If you
 have
 38 or less columns, or you have 64bit, then an all-in-R solution *might*
 get your task done quicker, depending on what your real goal is.

 If this sounds plausible, you could post more details and, if its
 appropriate, and luck is on your side, someone might even sketch out how
 to
 do an all-in-R solution.


 Nathan S. Watson-Haigh nathan.watson-ha...@csiro.au wrote in message
 news:4b5fde1b.10...@csiro.au...
I have a table (contact) with several fields and it's PK is an auto
increment field. I'm bulk loading data

Re: [R] RMySQL - Bulk loading data and creating FK links

2010-01-28 Thread Matthew Dowle
I'm talking about ease of use to.  The first line of the Details section in 
?[.data.table says :
   Builds on base R functionality to reduce 2 types of time :
   1. programming time (easier to write, read, debug and maintain)
   2. compute time

Once again, I am merely saying that the user has choices, and the best 
choice (and there are many choices including plyr, and lots of other great 
packages and base methods) depends on the task and the real goal.   This 
choice is not restricted to compute time only, as you seem to suggest.  In 
fact I listed programming time first (i.e ease of use).

To answer your points :

This is the SQL code you posted and I used in the comparison. Notice its 
quite long,  repeats the text var1,var2,var3 4 times, contains two 
'select's and a 'using'.
 system.time(sqldf(select var1, var2, var3, dt from a, (select var1, var2, 
 var3, min(dt) mindt from a group by var1, var2, var3) using(var1, var2, 
 var3) where dt - mindt  7))
   user  system elapsed
 103.132.17  106.23

Isolating the series of operations you described :
 system.time(sqldf(select * from a))
   user  system elapsed
  39.000.63   39.62

So thats roughly 40% of the time. Whats happening in the remaining 66 secs?

Heres a repeat of the equivalent in data.table :

 system.time({adt-data.table(a)})
   user  system elapsed
   0.900.131.03
 system.time(adt[ , list(dt=dt[dt-min(dt)7]) , by=var1,var2,var3]) 
 #  is that so hard to use compared to the SQL above ?
   user  system elapsed
   3.920.784.71

I looked at the news section, but I didn't find the benchmarks quickly or 
easily.  The links I saw took me to the FAQs.



Gabor Grothendieck ggrothendi...@gmail.com wrote in message 
news:971536df1001280855i1d5f7c03v46f7a3e58ff93...@mail.gmail.com...
I think one would only be concerned about such internals if one were
primarily interested in performance; otherwise, one would be more
interested in ease of specification and part of that ease is having it
independent of implementation and separating implementation from
specification activities.  An example of separation of specification
and implementation is that by simply specifying a disk-based database
rather than an in-memory database SQL can perform queries that take
more space than memory.  The query itself need not be modified.

I think the viewpoint you are discussing is primarily one of
performance whereas the viewpoint I was discussing is primarily ease
of use and that accounts for the difference.

I believe your performance comparison is comparing a sequence of
operations that include building a database, transferring data to it,
performing the operation, reading it back in and destroying the
database to an internal manipulation.  I would expect the internal
manipulation, particular one done primarily in C code as is the case
with data.table, to be faster although some benchmarks of the database
approach found that it compared surprisingly well to straight R code
-- some users of sqldf found that for an 8000 row data frame sqldf
actually ran faster than aggregate and also faster than tapply.  The
News section on the sqldf home page provides links to their
benchmarks.  Thus if R is fast enough then its likely that the
database approach is fast enough too since its even faster.

On Thu, Jan 28, 2010 at 8:52 AM, Matthew Dowle mdo...@mdowle.plus.com 
wrote:
 Are you claiming that SQL is that utopia? SQL is a row store. It cannot
 give the user the benefits of column store.

 For example, why does SQL take 113 seconds in the example in this thread :
 http://tolstoy.newcastle.edu.au/R/e9/help/10/01/1872.html
 but data.table takes 5 seconds to get the same result ? How come the high
 level language SQL doesn't appear to hide the user from this detail ?

 If you are just describing utopia, then of course I agree. It would be
 great to have a language which hid us from this. In the meantime the user
 has choices, and the best choice depends on the task and the real goal.

 Gabor Grothendieck ggrothendi...@gmail.com wrote in message
 news:971536df1001280428p345f8ff4v5f3a80c13f96d...@mail.gmail.com...
 Its only important internally. Externally its undesirable that the
 user have to get involved in it. The idea of making software easy to
 write and use is to hide the implementation and focus on the problem.
 That is why we use high level languages, object orientation, etc.

 On Thu, Jan 28, 2010 at 4:37 AM, Matthew Dowle mdo...@mdowle.plus.com
 wrote:
 How it represents data internally is very important, depending on the 
 real
 goal :
 http://en.wikipedia.org/wiki/Column-oriented_DBMS


 Gabor Grothendieck ggrothendi...@gmail.com wrote in message
 news:971536df1001271710o4ea62333l7f1230b860114...@mail.gmail.com...
 How it represents data internally should not be important as long as
 you can do what you want. SQL is declarative so you just specify what
 you want rather than how to get it and invisibly to the user it
 automatically draws

Re: [R] RMySQL - Bulk loading data and creating FK links

2010-01-27 Thread Matthew Dowle

 sqldf(select * from BOD order by Time desc limit 3)
Exactly. SQL requires use of order by. It knows the order, but it isn't 
ordered. Thats not good, but might be fine, depending on what the real goal 
is.


Gabor Grothendieck ggrothendi...@gmail.com wrote in message 
news:971536df1001270629w4795da89vb7d77af6e4e8b...@mail.gmail.com...
On Wed, Jan 27, 2010 at 8:56 AM, Matthew Dowle mdo...@mdowle.plus.com 
wrote:
 How many columns, and of what type are the columns ? As Olga asked too, it
 would be useful to know more about what you're really trying to do.

 3.5m rows is not actually that many rows, even for 32bit R. Its depends on
 the columns and what you want to do with those columns.

 At the risk of suggesting something before we know the full facts, one
 possibility is to load the data from flat file into data.table. Use 
 setkey()
 to set your keys. Use tables() to summarise your various tables. Then do
 your joins etc all-in-R. data.table has fast ways to do those sorts of
 joins (but we need more info about your task).

 Alternatively, you could check out the sqldf website. There is an
 sqlread.csv (or similar name) which can read your files directly into SQL

read.csv.sql

 instead of going via R. Gabor has some nice examples there about that and
 its faster.

 You use some buzzwords which makes me think that SQL may not be 
 appropriate
 for your task though. Can't say for sure (because we don't have enough
 information) but its possible you are struggling because SQL has no row
 ordering concept built in. That might be why you've created an increment

In the SQLite database it automatically assigns a self incrementing
hidden column called rowid to each row. e.g. using SQLite via the
sqldf package on CRAN and the BOD data frame which is built into R we
can display the rowid column explicitly by referring to it in our
select statement:

 library(sqldf)
 BOD
  Time demand
118.3
22   10.3
33   19.0
44   16.0
55   15.6
67   19.8
 sqldf(select rowid, * from BOD)
  rowid Time demand
1 118.3
2 22   10.3
3 33   19.0
4 44   16.0
5 55   15.6
6 67   19.8


 field? Do your queries include order by incrementing field? SQL is not
 good at first and last type logic. An all-in-R solution may well be

In SQLite you can get the top 3 values, say, like this (continuing the
prior example):

 sqldf(select * from BOD order by Time desc limit 3)
  Time demand
17   19.8
25   15.6
34   16.0

 better, since R is very good with ordered vectors. A 1GB data.table (or
 data.frame) for example, at 3.5m rows, could have 76 integer columns, or
 38 double columns. 1GB is well within 32bit and allows some space for
 working copies, depending on what you want to do with the data. If you 
 have
 38 or less columns, or you have 64bit, then an all-in-R solution *might*
 get your task done quicker, depending on what your real goal is.

 If this sounds plausible, you could post more details and, if its
 appropriate, and luck is on your side, someone might even sketch out how 
 to
 do an all-in-R solution.


 Nathan S. Watson-Haigh nathan.watson-ha...@csiro.au wrote in message
 news:4b5fde1b.10...@csiro.au...
I have a table (contact) with several fields and it's PK is an auto
increment field. I'm bulk loading data to this table from files which if
successful will be about 3.5million rows (approx 16000 rows per file).
However, I have a linking table (an_contact) to resolve a m:m relationship
between the an and contact tables. How can I retrieve the PK's for the 
data
bulk loaded into contact so I can insert the relevant data into 
an_contact.

 I currently load the data into contact using:
 dbWriteTable(con, contact, dat, append=TRUE, row.names=FALSE)

 But I then need to get all the PK's which this dbWriteTable() appended to
 the contact table so I can load the data into my an_contact link table. I
 don't want to issue a separate INSERT query for each row in dat and then
 use MySQLs LAST_INSERT_ID() functionnot when I have 3.5million rows 
 to
 insert!

 Any pointers welcome,
 Nathan

 --
 
 Dr. Nathan S. Watson-Haigh
 OCE Post Doctoral Fellow
 CSIRO Livestock Industries
 University Drive
 Townsville, QLD 4810
 Australia

 Tel: +61 (0)7 4753 8548
 Fax: +61 (0)7 4753 8600
 Web: http://www.csiro.au/people/Nathan.Watson-Haigh.html


 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide 
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Once again: Error: cannot allocate vector of size

2010-01-22 Thread Matthew Dowle
Please re-read the posting guide e.g. you didn't provide an example data set 
or a way to generate one,  or any R version information.

Werner W. pensterfuz...@yahoo.de wrote in message 
news:646146.32238...@web23002.mail.ird.yahoo.com...
 Hi,

 I have browsed the help list and looked at the FAQ but I don't find 
 conclusive evidence if this is normal or I am doing something wrong.
 I am running a lm() on a data.frame with 27136 observations of 6 variables 
 (3 num and 3 factor).
 After a while R throws this:

 lm(log(y) ~ log(a) + log(b) + c + d + e, data=reg.data , 
 na.action=na.exclude)
 Error: cannot allocate vector of size 203.7 MB

 This is a Windows XP 32 bit machine with 4 GB in it so that theoretically, 
 R should be able to claim close to 2 GB.
 This is the gc() after the regression:
  used (Mb) gc trigger  (Mb)  max used   (Mb)
 Ncells  272299  7.3 875833  23.4   1368491   36.6
 Vcells 4526037 34.6  116536251 889.2 145524997 1110.3

 memory.size(max=T)
 [1] 1230.25
 memory.size(max=F)
 [1] 47.89


 Looking at memory.size, R should be easily able to allocate that space, 
 shouldn't it?

 Many thanks for any hints!

 --Werner

 __
 Do You Yahoo!?
 Sie si
 n Massenmails.
 http://mail.yahoo.com


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Merging and extracting data from list

2010-01-22 Thread Matthew Dowle

?merge
plyr
data.table
sqldf
crantastic

Dr. Viviana Menzel vivianamen...@gmx.de wrote in message 
news:4b58a0e9.3050...@gmx.de...
Hello R-help group,

I have a question about merging lists. I have two lists:

Genes list (hSgenes)
namechrstrandstartendtransStarttransEnd
symboldescriptionfeature
ENSG02239721111874144121187414412
DEAD/H box polypeptide 11 like 1DEAD/H box polypeptide 11 like 3DEAD/H
box polypeptide 11 like 9 ;; [Source:UniProtKB/TrEMBL;Acc:B7ZGX0]gene
ENSG02272321-114363295701755129343
WASH5PWAS protein family homolog 5 pseudogene (WASH5P), non-coding
RNA [Source:RefSeq DNA;Acc:NR_024540]gene
.

Chers list (chersList)
namechrstartendcellTypeantibodyfeatures
maxLevelscore
chr1.cher11859132859732humanABENSG0223764
ENSG0231958 ENSG01876341.257360389683160.664381383074449
chr1.cher21889564890464humanABENSG0188976
1.478842336320642.88839131446868
chr1.cher3111063641106864humanAB
ENSG01625711.837956544181153.58404359147275


In the second list, I want to add a column with the gene description
(obtained from the first list). I used the following method:

chersMergeGenes -
data.frame(chersList,description=hSgenes$description[match(chersList$features,
hSgenes$name)],symbol=hSgenes$symbol[match(chersList$features,
hSgenes$name)])
write.table(chersMergeGenes, row.names=F, quote=F, sep=\t,
file=chersMergeGenes.txt)


and it works only partially. When chersList$features contains more than
a feature (e.g. ENSG0223764 ENSG0231958 ENSG0187634), it
doesn't work (NA as result).
But I don't know how to split the features to obtain all descriptions.

Can someone give me a hint to do this?


Another problem:

I have following data:

$ENSG003
[1] GO:0043123 GO:0004871

$ENSG419
 [1] GO:0018406 GO:0035269 GO:0006506 GO:0019348 GO:0005789
 [6] GO:0005624 GO:0005783 GO:0033185 GO:0004582 GO:0004169
[11] GO:0005515

$ENSG457
[1] GO:0005737 GO:0030027 GO:0005794 GO:0005515

I want to extract a list of names ($ENSG0?) where go =
GO:0005515. How can I do it?

Thanks on advance

Viviana

-- 
~~~
Dr. Viviana Menzel
Rottweg 34
35428 Langgöns
Tel.: +49 6403 7748550
Mobil: +49 177 5126092
E-Mail: vivianamen...@gmx.de
Web: www.dres-menzel.de

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] loop on list levels and names

2010-01-22 Thread Matthew Dowle

Great.

If you mean the crantastic r package, sorry I wasn't clear,  I meant the 
crantastic website http://crantastic.org/.
If you meant the description of plyr then if the description looks useful 
then click the link taking you to the package documentation and read it. 
Same for any of the other packages.

The idea,  I think,  is that its a good idea to make yourself aware of the 
most popular packages i.e. perhaps just read the descriptions of the top 30 
or something like that maybe.  Maybe it helps you avoid re-inventing the 
wheel.  That seems to be the case here.

Re Don's reply, sure you can use split().  But that will use more memory. 
And using paste for this?  Ok, it works, but don't you want to use better 
ways?  data.table should be much faster and more convenient, quicker to 
write than split and paste like that.

HTH


Ivan Calandra ivan.calan...@uni-hamburg.de wrote in message 
news:4b59bdc5.60...@uni-hamburg.de...
I didn't know about crantastic actually.
I've looked what it is exactly and it indeed looks interesting, but I
don't really see how I would know that it would help me for the task.
There's a description of what it was built for, but how can I then know
which function from this package can help me?

Thanks for your answer (you all), I'll work on it!
I'll keep you informed if it doesn't work (!), and I'll go vote on
crantastic when I'll have a bit more experience with the packages I use
(right now I'm just using the ones I was told for one specific
function), but don't worry I won't forget. As you said It only works if
users contribute to it. That makes the power of R!

Ivan



Le 1/21/2010 19:01, Matthew Dowle a écrit :
 One way is :

 dataset = data.table(ssfamed)
 dataset[,  whatever some functions are on Asfc, Smc, epLsar, etc,
 by=SPECSHOR,BONE]

 Your SPECSHOR and BONE names will be in your result alongside the results 
 of
 thewhatever ...

 Or try package plyr which does this sort of thing too.  And sqldf may be
 better if you know SQL and prefer it.  There are actually zillions of ways
 to do it : by(), doBy() etc etc

 If you get your code working the way its constructed currently,  its going
 to be very slow, because of those ==.   data.table doesn't do that and 
 is
 pretty fast for this kind of thing. You might find that plyr is easier to
 use and more flexible though if speed isn't an issue,  depending on 
 exactly
 what you want to do.

 Whichever way you decide,  consider voting on crantastic for the package 
 you
 end up using,  and that may be a quick and easy way for you to help new R
 users in the future, and help us all by reducing the r-help traffic on the
 same subject over and over again.

 Note that plyr is the 2nd spot on crantastic,  it would have solved your
 problem without needing to write that code.  If you check crantastic first
 and make sure you're aware of popular packages, it might avoid getting 
 stuck
 in this way again.  It only works if users contribute to it though.


 Ivan Calandraivan.calan...@uni-hamburg.de  wrote in message
 news:4b587cdd.4070...@uni-hamburg.de...

 Hi everybody!

 To use some functions, I have to transform my dataset into a list, where
 each element contains one group, and I have to prepare a list for each
 variable I have (altogether I have 15 variables, and many entries per
 factor level)

 Here is some part of my dataset:
 SPECSHORBONEAsfcSmcepLsar
 cotautx454.39036929.2616380.001136
 cotautx117.4457114.2918840.00056
 cotautx381.02468215.3130170.002324
 cotautx159.08178918.1345330.000462
 cotautm160.6415036.4113320.000571
 cotautm79.2380233.8282540.001182
 cotautm143.2065511.9218990.000192
 cotautm115.47699633.1163860.000417
 cotautm594.25623472.5381310.000477
 eqgretx188.2613248.2790960.000777
 eqgretx152.4442162.5963250.001022
 eqgretx256.6015078.2790960.000566
 eqgretx250.81644518.1345330.000535
 eqgretx272.39671124.4928790.000585
 eqgretm172.632644.2918840.001781
 eqgretm189.44109714.4254980.001347
 eqgretm170.74378813.5644720.000602
 eqgretm158.96084910.3852990.001189
 eqgretm80.9724083.8282540.000644
 gicamtx294.4940019.6567380.000524
 gicamtx267.12676519.1280240.000647
 gicamtx81.8886584.7820060.000492
 gicamtx168.3290812.7299390.001097
 gicamtx123.2960567.0074270.000659
 gicamtm94.26488718.1345330.000752
 gicamtm54.3173953.8282540.00038
 gicamtm55.97888317.1675340.000141
 gicamtm279.59799315.3130170.000398
 gicamtm288.26255618.1345330.001043

 What I do next is:
 
 list_Asfc- list()
 list_Asfc[[1]]- ssfamed[ssfamed$SPECSHOR

Re: [R] Once again: Error: cannot allocate vector of size

2010-01-22 Thread Matthew Dowle
Fantastic. You're much more likely to get a response now.  Best of luck.

werner w pensterfuz...@yahoo.de wrote in message 
news:1264175935970-1100164.p...@n4.nabble.com...

 Thanks Matthew, you are absolutely right.

 I am working on Windows XP SP2 32bit with R versions 2.9.1.

 Here is an example:
  d - as.data.frame(matrix(trunc(rnorm(6*27136, 1, 100)),ncol=6))
  d[,4:5] - trunc(100*runif(2*27136, 0, 1))
  d[,6] - trunc(1000*runif(27136, 0, 1))
  for (i in 4:6) d[,i] - as.factor(d[,i])
  lm(V1 ~ log(V2) + log(V3) + V4 + V5 + V6, data=d)
  memory.size(max=F)
  memory.size(max=T)

 I managed to get it run through after setting the 3GB switch for Windows 
 and
 with a clean R session.
 I also noticed later, that after removing na.action=na.exclude more
 regressions run through.

 But before and after the lm() it seems there should be enough memory which
 means that lm() builds up some quite large objects during its 
 computations?
 -- 
 View this message in context: 
 http://n4.nabble.com/Once-again-Error-cannot-allocate-vector-of-size-tp1083506p1100164.html
 Sent from the R help mailing list archive at Nabble.com.


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] loop on list levels and names

2010-01-22 Thread Matthew Dowle
data.table is the package name too. Make sure you find ?[.data.table which 
is linked from ?data.table.
You could just do a mean of one variable first, and then build it up from 
there  e.g.  dataset[, mean(epLsar), by=SPECSHOR,BONE].
To get multiple columns of output,  wrap with DT() like this   dataset[, 
DT(mean(epLsar),min(epLsar)), by=SPECSHOR,BONE]
Btw, v1.3 on r-forge fixes a version check warning with v1.2 on R2.10+ (not 
fixed by me but thanks to a contributor) so if you can't live with the 
warning messages, you can install v1.3 from r-forge like this :
install.packages(data.table,repos=http://r-forge.r-project.org;)

Best of luck.

Ivan Calandra ivan.calan...@uni-hamburg.de wrote in message 
news:4b59d93c.5080...@uni-hamburg.de...
Thanks for your advice, I will work on it then!
Just one last question. In which package can I find the function
data.table?
Ivan

Le 1/22/2010 17:18, Matthew Dowle a écrit :
 Great.

 If you mean the crantastic r package, sorry I wasn't clear,  I meant the
 crantastic website http://crantastic.org/.
 If you meant the description of plyr then if the description looks useful
 then click the link taking you to the package documentation and read it.
 Same for any of the other packages.

 The idea,  I think,  is that its a good idea to make yourself aware of the
 most popular packages i.e. perhaps just read the descriptions of the top 
 30
 or something like that maybe.  Maybe it helps you avoid re-inventing the
 wheel.  That seems to be the case here.

 Re Don's reply, sure you can use split().  But that will use more memory.
 And using paste for this?  Ok, it works, but don't you want to use better
 ways?  data.table should be much faster and more convenient, quicker to
 write than split and paste like that.

 HTH


 Ivan Calandraivan.calan...@uni-hamburg.de  wrote in message
 news:4b59bdc5.60...@uni-hamburg.de...
 I didn't know about crantastic actually.
 I've looked what it is exactly and it indeed looks interesting, but I
 don't really see how I would know that it would help me for the task.
 There's a description of what it was built for, but how can I then know
 which function from this package can help me?

 Thanks for your answer (you all), I'll work on it!
 I'll keep you informed if it doesn't work (!), and I'll go vote on
 crantastic when I'll have a bit more experience with the packages I use
 (right now I'm just using the ones I was told for one specific
 function), but don't worry I won't forget. As you said It only works if
 users contribute to it. That makes the power of R!

 Ivan



 Le 1/21/2010 19:01, Matthew Dowle a écrit :

 One way is :

 dataset = data.table(ssfamed)
 dataset[,   whatever some functions are on Asfc, Smc, epLsar, etc,
 by=SPECSHOR,BONE]

 Your SPECSHOR and BONE names will be in your result alongside the results
 of
 thewhatever ...

 Or try package plyr which does this sort of thing too.  And sqldf may be
 better if you know SQL and prefer it.  There are actually zillions of 
 ways
 to do it : by(), doBy() etc etc

 If you get your code working the way its constructed currently,  its 
 going
 to be very slow, because of those ==.   data.table doesn't do that and
 is
 pretty fast for this kind of thing. You might find that plyr is easier to
 use and more flexible though if speed isn't an issue,  depending on
 exactly
 what you want to do.

 Whichever way you decide,  consider voting on crantastic for the package
 you
 end up using,  and that may be a quick and easy way for you to help new R
 users in the future, and help us all by reducing the r-help traffic on 
 the
 same subject over and over again.

 Note that plyr is the 2nd spot on crantastic,  it would have solved your
 problem without needing to write that code.  If you check crantastic 
 first
 and make sure you're aware of popular packages, it might avoid getting
 stuck
 in this way again.  It only works if users contribute to it though.


 Ivan Calandraivan.calan...@uni-hamburg.de   wrote in message
 news:4b587cdd.4070...@uni-hamburg.de...


 Hi everybody!

 To use some functions, I have to transform my dataset into a list, where
 each element contains one group, and I have to prepare a list for each
 variable I have (altogether I have 15 variables, and many entries per
 factor level)

 Here is some part of my dataset:
 SPECSHORBONEAsfcSmcepLsar
 cotautx454.39036929.2616380.001136
 cotautx117.4457114.2918840.00056
 cotautx381.02468215.3130170.002324
 cotautx159.08178918.1345330.000462
 cotautm160.6415036.4113320.000571
 cotautm79.2380233.8282540.001182
 cotautm143.2065511.9218990.000192
 cotautm115.47699633.1163860.000417
 cotautm594.25623472.5381310.000477
 eqgretx188.2613248.2790960.000777
 eqgretx152.4442162.5963250.001022
 eqgretx256.6015078.2790960.000566
 eqgre

Re: [R] loop on list levels and names

2010-01-21 Thread Matthew Dowle
One way is :

dataset = data.table(ssfamed)
dataset[,  whatever some functions are on Asfc, Smc, epLsar, etc , 
by=SPECSHOR,BONE]

Your SPECSHOR and BONE names will be in your result alongside the results of 
the whatever ...

Or try package plyr which does this sort of thing too.  And sqldf may be 
better if you know SQL and prefer it.  There are actually zillions of ways 
to do it : by(), doBy() etc etc

If you get your code working the way its constructed currently,  its going 
to be very slow, because of those ==.   data.table doesn't do that and is 
pretty fast for this kind of thing. You might find that plyr is easier to 
use and more flexible though if speed isn't an issue,  depending on exactly 
what you want to do.

Whichever way you decide,  consider voting on crantastic for the package you 
end up using,  and that may be a quick and easy way for you to help new R 
users in the future, and help us all by reducing the r-help traffic on the 
same subject over and over again.

Note that plyr is the 2nd spot on crantastic,  it would have solved your 
problem without needing to write that code.  If you check crantastic first 
and make sure you're aware of popular packages, it might avoid getting stuck 
in this way again.  It only works if users contribute to it though.


Ivan Calandra ivan.calan...@uni-hamburg.de wrote in message 
news:4b587cdd.4070...@uni-hamburg.de...
 Hi everybody!

 To use some functions, I have to transform my dataset into a list, where
 each element contains one group, and I have to prepare a list for each
 variable I have (altogether I have 15 variables, and many entries per
 factor level)

 Here is some part of my dataset:
 SPECSHORBONEAsfcSmcepLsar
 cotautx454.39036929.2616380.001136
 cotautx117.4457114.2918840.00056
 cotautx381.02468215.3130170.002324
 cotautx159.08178918.1345330.000462
 cotautm160.6415036.4113320.000571
 cotautm79.2380233.8282540.001182
 cotautm143.2065511.9218990.000192
 cotautm115.47699633.1163860.000417
 cotautm594.25623472.5381310.000477
 eqgretx188.2613248.2790960.000777
 eqgretx152.4442162.5963250.001022
 eqgretx256.6015078.2790960.000566
 eqgretx250.81644518.1345330.000535
 eqgretx272.39671124.4928790.000585
 eqgretm172.632644.2918840.001781
 eqgretm189.44109714.4254980.001347
 eqgretm170.74378813.5644720.000602
 eqgretm158.96084910.3852990.001189
 eqgretm80.9724083.8282540.000644
 gicamtx294.4940019.6567380.000524
 gicamtx267.12676519.1280240.000647
 gicamtx81.8886584.7820060.000492
 gicamtx168.3290812.7299390.001097
 gicamtx123.2960567.0074270.000659
 gicamtm94.26488718.1345330.000752
 gicamtm54.3173953.8282540.00038
 gicamtm55.97888317.1675340.000141
 gicamtm279.59799315.3130170.000398
 gicamtm288.26255618.1345330.001043

 What I do next is:
 
 list_Asfc - list()
 list_Asfc[[1]] - ssfamed[ssfamed$SPECSHOR=='cotau'ssfamed$BONE=='tx', 3]
 list_Asfc[[2]] - ssfamed[ssfamed$SPECSHOR=='cotau'ssfamed$BONE=='tm', 3]
 

 And so on for each level of SPECSHOR and BONE

 I'm stuck on 2 parts:
 - in a loop or something similar, I would like the 1st element of the
 list to be filled by the values for the 1st variable with the first
 level of my factors (i.e. cotau + tx), and then the 2nd element with the
 2nd level (i.e. cotau + tm) and so on. As shown above, I know how to do
 it if I enter manually the different levels, but I have no idea which
 function I should use so that each combination of factor will be used.
 See what I mean?

 - I would then like to run it in a loop or something for each variable.
 It is by itself not so complicated, but I don't know how to give the
 correct name to my list. I want the list containing the data for Asfc to
 be named list_Asfc.
 Here is what I tried:
 
 seq.num - c(seq(3,5,1)) #the indexes of the variables
 for(i in 1:length(seq.num)) {
  k - seq.num[i]
  name.num - names(ssfamed)[k]
  list - list()
  list[[1]] - ssfamed[ssfamed$SPECSHOR=='cotau'ssfamed$BONE=='tx', i]
  list[[2]] - ssfamed[ssfamed$SPECSHOR=='cotau'ssfamed$BONE=='tm', i]
  names(list) - c(cotau_tx, cotau_tm) #I have more and the 1st
 question should help me on that too
 }
 
 After names(list) I need to insert something like: name_list - list
 But I don't know how to give it the correct name. How do we change the
 name of an object? Or am I on the wrong path?

 Thank you in advance for your help.
 Ivan

 PS: if necessary: under Windows XP, R2.10.












 [[alternative HTML version deleted]]


__

Re: [R] Mutliple sets of data in one dataset....Need a loop?

2010-01-21 Thread Matthew Dowle
 but I have thousands of results so it would be really hand to find away of 
 doing this quickly
 its a little difficult to follow those examples

Given your data in data.frame DF, maybe add the following to your list to 
investigate :

 dat = data.table(DF)
 dat[, cor(Score1,Score2), by=Experiment]
 Experiment V1
[1,]  X  0.9889524
[2,]  Y  0.3041195
[3,]  Z -0.1346107

To do a plot instead just replace cor with plot or whatever else you 
want to do within each group.
Since you said you have thousands of results,  data.table is faster for 
that.

In terms of ease of use,  you could try plyr too,  which you may well 
prefer.

 those examples as all seem so different

If you look and search crantastic, users are putting their comments there. 
That might help you make a decision more quickly and avoid you needing to 
post to r-help and wait for a reply,  assuming there is a package that 
already does what you need. Searching the history of r-help would have found 
many solutions to your problem this time, but it seems you are looking for 
advice on the best way. This changes over time and depends on lots of 
factors, including what you really want to do. Once you have worked out 
which packages work best for you, put your votes/comments onto crantastic 
and it should help everyone who follows in your path.  I guess you should 
then update your votes/comments as time progresses too.

Btw, plyr is ranked #2 on crantastic and is designed specifically for your 
task !!  Making yourself aware of the most popular packages would have 
helped you.If you need speed try data.table.  When it comes to current, 
up to date advice on the most appropriate package, crantastic could be 
fantastic, assuming of course that you, the user, contributes to it.

HTH

BioStudent s0975...@sms.ed.ac.uk wrote in message 
news:1264072645590-1049653.p...@n4.nabble.com...

 Hi Thanks for all your help

 Its a little difficult to follow those examples as all seem so different 
 and
 its hard to see how I do what I want to my data from the help files but 
 i'll
 try...
 -- 
 View this message in context: 
 http://n4.nabble.com/Mutliple-sets-of-data-in-one-dataset-Need-a-loop-tp1018503p1049653.html
 Sent from the R help mailing list archive at Nabble.com.


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] problem of data manipulation

2010-01-20 Thread Matthew Dowle

The user wrote in their first post :
 I have a lot of observations in my dataset

Heres one way to do it with a data.table :
 a=data.table(a)
 ans = a[ , list(dt=dt[dt-min(dt)7]) , by=var1,var2,var3]
 class(ans$dt) = Date

Timings are below comparing the 3 methods. In this example, data.table 
appears to be 28 times faster than plyr, and 24 times faster than sqldf.  I 
excluded the one off time to build the key, since thats realistic, but even 
including that time, data.table is still 16 times faster than plyr (134 / 
(1.03+2.16+4.71)).  With even more rows, it should be even bigger speedups.

 a - structure(list(var1 = structure(c(3L, 1L, 1L, 2L, 2L, 2L), .Label = 
 c(c,
n, s), class = factor), var2 = c(1L, 1L, 1L, 2L, 2L, 2L), var3 = c(2L,
2L, 2L, 1L, 1L, 1L), dt = structure(c(10592, 10997, 11000, 10998,
11002, 11010), class = Date)), .Names = c(var1, var2, var3,
dt), row.names = c(NA, -6L), class = data.frame)

 a = data.frame(lapply(a,function(x)rep(x,each=100)))
 dim(a)
[1] 600   4
 library(plyr)
 system.time({ans1 - ddply(a, c(var1, var2, var3), subset, dt - 
 min(dt)  7)})
   user  system elapsed
 131.393.11  134.80
 library(sqldf)
 system.time({ans2 - sqldf(select var1, var2, var3, dt from a, (select 
 var1, var2, var3, min(dt) mindt from a group by var1, var2, var3) 
 using(var1, var2, var3) where dt - mindt  7)})
   user  system elapsed
 110.262.24  113.32
 mapply(identical,ans1,ans2[order(ans2$var1),])
var1 var2 var3   dt
TRUE TRUE TRUE TRUE

 library(data.table)
 system.time({adt-data.table(a)})
   user  system elapsed
   0.900.131.03
 system.time({setkey(adt,var1,var2,var3)})
   user  system elapsed
   1.890.272.16
 system.time({ans3 - 
 adt[,list(dt=dt[dt-min(dt)7]),by=var1,var2,var3]})
   user  system elapsed
   3.920.784.71
 class(ans3$dt) = Date
 mapply(identical,ans1,ans3)
var1 var2 var3   dt
TRUE TRUE TRUE TRUE

Note that in the documentaton ?[.data.table where I say that 'by' is slow, 
I mean relative to how fast it could be.  Its seems, in this specific 
example anyway, and with the code posted so far, to be significantly faster 
than sqldf and plyr.


Gabor Grothendieck ggrothendi...@gmail.com wrote in message 
news:971536df1001191350x3bd5d982j9879e05453760...@mail.gmail.com...
 Using data frame, a, from the post below this is how it would be done
 in SQL using sqldf.  We join together the original table, a,  with a
 table of minimums (computed by the nested select) and then choose only
 the rows where dt - mindt  7 (in the where clause).

 library(sqldf)
 sqldf(select var1, var2, var3, dt from a, (select var1, var2, var3, 
 min(dt) mindt from a group by var1, var2, var3) using(var1, var2, var3) 
 where dt - mindt  7)
  var1 var2 var3 dt
 1s12 1999-01-01
 2c12 2000-02-10
 3c12 2000-02-13
 4n21 2000-02-11
 5n21 2000-02-15


 On Tue, Jan 19, 2010 at 4:22 PM, hadley wickham h.wick...@gmail.com 
 wrote:
 On Mon, Jan 18, 2010 at 1:54 PM, Bert Gunter gunter.ber...@gene.com 
 wrote:
 One way to do it:

 1. Convert your date column to the Date class using the as.Date() 
 function.
 This allows you to do the necessary arithmetic on the dates below.
 dt - as.Date(a[,4],%d/%m/%Y)

 2. Create a factor out of your first three columns whose levels are in 
 the
 same order as the unique rows. Something likes the following should do 
 it:
 fac - do.call(paste,a[,-4])
 fac - factor(fac, levels=unique(fac))

 This allows you to choose the groups of rows whose dates you wish to 
 compare
 and maintain their correct order in the data frame

 3. Then use tapply:
 a[unlist(tapply(dt,fac,function(x)x-min(x)  7)),]

 (unlist is needed to remove the list structure and concatenate the 
 logical
 indices to obtain the subscripting vector).

 Here's the same basic approach with the plyr package:

 a - structure(list(var1 = structure(c(3L, 1L, 1L, 2L, 2L, 2L), .Label = 
 c(c,
 n, s), class = factor), var2 = c(1, 1, 1, 2, 2, 2), var3 = c(2,
 2, 2, 1, 1, 1), dt = structure(c(10592, 10997, 11000, 10998,
 11002, 11010), class = Date)), .Names = c(var1, var2, var3,
 dt), row.names = c(NA, -6L), class = data.frame)

 library(plyr)
 ddply(a, c(var1, var2, var3), subset, dt - min(dt)  7)

 Hadley


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] problem of data manipulation

2010-01-20 Thread Matthew Dowle
Sounds like a good idea. Would it be possible to give an example of how to 
combine plyr with data.table, and why that is better than a data.table only 
solution ?

hadley wickham h.wick...@gmail.com wrote in message 
news:f8e6ff051001200624r2175e38xf558dc8fa3fb6...@mail.gmail.com...
 Note that in the documentaton ?[.data.table where I say that 'by' is 
 slow,
 I mean relative to how fast it could be. Its seems, in this specific
 example anyway, and with the code posted so far, to be significantly 
 faster
 than sqldf and plyr.

Of course the best of both worlds would be to use data table within
plyr to get both speed and a consistent syntax for other types of
split-apply-combine tasks.

Hadley


-- 
http://had.co.nz/

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


  1   2   >