Marc, The data structure is a list of data frames generated from read.table:
> class(nmr.spectra.serum) [1] "list" > class(nmr.spectra.serum[[1]]) [1] "data.frame" > dim(nmr.spectra.serum[[1]]) [1] 32768 2 Converting the data.frames to matrices does not have much of an effect on timing. -Christos -----Original Message----- From: Marc Schwartz [mailto:[EMAIL PROTECTED] Sent: Thursday, February 01, 2007 11:06 PM To: [EMAIL PROTECTED] Cc: 'Prof Brian Ripley'; [email protected] Subject: Re: [R] Lining up x-y datasets based on values of x On Thu, 2007-02-01 at 22:46 -0500, Christos Hatzis wrote: > Marc, > > I don't think the issue is duplicates in the matching columns. The > data were generated by an instrument (NMR spectrometer), processed by > the instrument's software through an FFT transform and other > transformations and finally reported as a sequence of chemical shift (x) vs intensity (y) pairs. > So all x values are unique. For the example that I reported earlier: > > > length(nmr.spectra.serum[[1]]$V1) > [1] 32768 > > length(unique(nmr.spectra.serum[[1]]$V1)) > [1] 32768 > > length(nmr.spectra.serum[[2]]$V1) > [1] 32768 > > length(unique(nmr.spectra.serum[[2]]$V1)) > [1] 32768 > > And most of the x-values are common > > sum(nmr.spectra.serum[[1]]$V1 %in% nmr.spectra.serum[[2]]$V1) > [1] 32625 > > For this reason, merge is probably an overkill for this problem and my > initial thought was to align the datasets through some simple > index-shifting operation. > > Profiling of the merge code in my case shows that most of the time is > spent on data frame subsetting operations and on internal merge and > rbind calls secondarily (if I read the summary output correctly). So > even if most of the time in the internal merge function is spent on > sorting (haven't checked the source code), this is in the worst case a > rather minor effect, as suggested by Prof. Ripley. > > > Rprof("merge.out") > > zz <- merge(nmr.spectra.serum[[1]], nmr.spectra.serum[[2]], by="V1", > all=T, sort=T) > > Rprof(NULL) > > summaryRprof("merge.out") > > $by.self > self.time self.pct total.time total.pct > merge.data.frame 6.56 50.0 11.84 90.2 > [.data.frame 2.42 18.4 3.68 28.0 > merge 1.28 9.8 13.12 100.0 > rbind 1.24 9.5 1.36 10.4 > names<-.default 1.16 8.8 1.16 8.8 > row.names<-.data.frame 0.12 0.9 0.18 1.4 > duplicated.default 0.12 0.9 0.12 0.9 > make.unique 0.10 0.8 0.10 0.8 > data.frame 0.02 0.2 0.04 0.3 > * 0.02 0.2 0.02 0.2 > is.na 0.02 0.2 0.02 0.2 > match 0.02 0.2 0.02 0.2 > order 0.02 0.2 0.02 0.2 > unclass 0.02 0.2 0.02 0.2 > [ 0.00 0.0 3.68 28.0 > do.call 0.00 0.0 1.18 9.0 > names<- 0.00 0.0 1.16 8.8 > row.names<- 0.00 0.0 0.18 1.4 > any 0.00 0.0 0.14 1.1 > duplicated 0.00 0.0 0.12 0.9 > cbind 0.00 0.0 0.04 0.3 > as.vector 0.00 0.0 0.02 0.2 > seq 0.00 0.0 0.02 0.2 > seq.default 0.00 0.0 0.02 0.2 > > $by.total > total.time total.pct self.time self.pct > merge 13.12 100.0 1.28 9.8 > merge.data.frame 11.84 90.2 6.56 50.0 > [.data.frame 3.68 28.0 2.42 18.4 > [ 3.68 28.0 0.00 0.0 > rbind 1.36 10.4 1.24 9.5 > do.call 1.18 9.0 0.00 0.0 > names<-.default 1.16 8.8 1.16 8.8 > names<- 1.16 8.8 0.00 0.0 > row.names<-.data.frame 0.18 1.4 0.12 0.9 > row.names<- 0.18 1.4 0.00 0.0 > any 0.14 1.1 0.00 0.0 > duplicated.default 0.12 0.9 0.12 0.9 > duplicated 0.12 0.9 0.00 0.0 > make.unique 0.10 0.8 0.10 0.8 > data.frame 0.04 0.3 0.02 0.2 > cbind 0.04 0.3 0.00 0.0 > * 0.02 0.2 0.02 0.2 > is.na 0.02 0.2 0.02 0.2 > match 0.02 0.2 0.02 0.2 > order 0.02 0.2 0.02 0.2 > unclass 0.02 0.2 0.02 0.2 > as.vector 0.02 0.2 0.00 0.0 > seq 0.02 0.2 0.00 0.0 > seq.default 0.02 0.2 0.00 0.0 > > $sampling.time > [1] 13.12 > > > Thanks again for your time in looking into this. > -Christos Christos, Thanks for the follow up. Thought I had something, but apparently not. Question: What is the actual structure of the nmr.spectra.serum objects? The indexing approach that you have suggests they are not simple two column objects, which may be at least partially the source of the [.data.frame overhead. Thanks, Marc ______________________________________________ [email protected] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
