That particular machine has been up for 40 days although all the parameters are in the green right now.
Doing an rbind is as quick as what you reported: > system.time(df2 <- rbind(df, df)) user system elapsed 0.104 0.000 0.104 And now I do get the GAlignments warning: > GappedAlignments() GAlignments with 0 alignments and 0 metadata columns: seqnames strand cigar qwidth start end width <Rle> <Rle> <character> <integer> <integer> <integer> <integer> ngap <integer> --- seqlengths: Warning message: The GappedAlignments class, the GappedAlignments() constructor, and the readGappedAlignments() function, have been renamed: GAlignments, GAlignments(), and readGAlignments(), respectively. The old names are deprecated. Please use the new names instead. And the appending works as for you: > library(Rsamtools) > library(RNAseqData.HNRNPC.bam.chr14) > bamfile <- BamFile(RNAseqData.HNRNPC.bam.chr14_BAMFILES[1L]) > yieldSize(bamfile) <- 100000L > open(bamfile) > out <- GappedAlignments() Warning message: The GappedAlignments class, the GappedAlignments() constructor, and the readGappedAlignments() function, have been renamed: GAlignments, GAlignments(), and readGAlignments(), respectively. The old names are deprecated. Please use the new names instead. > chunk <- readBamGappedAlignments(bamfile) Warning message: 'readBamGappedAlignments' is deprecated. Use 'readGAlignmentsFromBam' instead. See help("Deprecated") > system.time(out <- append(out, chunk)) user system elapsed 0.092 0.000 0.091 > chunk <- readBamGappedAlignments(bamfile) Warning message: 'readBamGappedAlignments' is deprecated. Use 'readGAlignmentsFromBam' instead. See help("Deprecated") > system.time(out <- append(out, chunk)) user system elapsed 0.372 0.000 0.369 > chunk <- readBamGappedAlignments(bamfile) Warning message: 'readBamGappedAlignments' is deprecated. Use 'readGAlignmentsFromBam' instead. See help("Deprecated") > system.time(out <- append(out, chunk)) user system elapsed 0.896 0.012 0.909 And the sessionInfo are as before: > sessionInfo() R version 3.0.1 (2013-05-16) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8 [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8 [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats graphics grDevices utils datasets methods [8] base other attached packages: [1] GenomicRanges_1.13.26 Biostrings_2.29.12 XVector_0.1.0 [4] IRanges_1.19.15 BiocGenerics_0.7.2 loaded via a namespace (and not attached): [1] stats4_3.0.1 So I'm not sure what happened; so far, I can only imagine an NFS / RAID related issue. Doing it with my own data gives the same results as above. Sorry for bothering you with that and many thanks for the help. Cheers, Nico --------------------------------------------------------------- Nicolas Delhomme Genome Biology Computational Support European Molecular Biology Laboratory Tel: +49 6221 387 8310 Email: nicolas.delho...@embl.de Meyerhofstrasse 1 - Postfach 10.2209 69102 Heidelberg, Germany --------------------------------------------------------------- On Jul 11, 2013, at 10:15 AM, Nicolas Delhomme wrote: > Hej Hervé! > > --------------------------------------------------------------- > Nicolas Delhomme > > Genome Biology Computational Support > > European Molecular Biology Laboratory > > Tel: +49 6221 387 8310 > Email: nicolas.delho...@embl.de > Meyerhofstrasse 1 - Postfach 10.2209 > 69102 Heidelberg, Germany > --------------------------------------------------------------- > > On Jul 10, 2013, at 8:54 PM, Hervé Pagès wrote: > >> Hi Nico, >> >> On 07/09/2013 08:07 AM, Nicolas Delhomme wrote: >>> Hej Bioc Core! >>> >>> There was some discussion last year about implementing a BamStreamer (à la >>> FastqStreamer), but I haven't seen anything like it in the current devel. >>> I've implemented the following function that should do the job for me - I >>> have many very large files, and I need to use a cluster with relatively few >>> RAM per node and a restrictive time allocation , so I want to parallelize >>> the reading of the BAM file to manage both. The example below is obviously >>> not affecting the RAM issue but I streamlined it to point out my issue. >>> >>> ".stream" <- function(bamFile,yieldSize=100000,verbose=FALSE){ >>> >>> ## create a stream >>> stopifnot(is(bamFile,"BamFile")) >>> >>> ## set the yieldSize if it is not set already >>> if(is.na(yieldSize(bamFile))){ >>> yieldSize(bamFile) <- yieldSize >>> } >>> >>> ## open it >>> open(bamFile) >>> >>> ## verb >>> if(verbose){ >>> message(paste("Streaming",basename(path(bamFile)))) >>> } >>> >>> ## create the output >>> out <- GappedAlignments() >>> >>> ## process it >>> while(length(chunk <- readBamGappedAlignments(bamFile))){ >>> if(verbose){ >>> message(paste("Processed",length(chunk),"reads")) >>> } >>> out <- c(out,chunk) >>> } >> >> Note that regardless the speed of c() on GappedAlignments objects, >> growing an object in a loop is fundamentally inefficient (see Circle 2 >> of The R Inferno). >> Also keeping the chunks in memory kind of defeats the purpose of reading >> the file one chunk at a time. > > Sure. What this function normally really does is a data reduction - basically > getting a named vector back. I just came across the appending issue when > preparing the code example above. > >> >>> >>> ## close >>> close(bamFile) >>> >>> ## return >>> return(out) >>> } >>> >>> In the method above, the first iteration of combining the GappedAlignments: >>> >>> out <- c(out,chunk) takes: >>> >>> system.time(append(out,chunk)) >>> >>> user system elapsed >>> 123.704 0.060 124.011 >> >> 2 minutes! Whaoo, that's really slow. I can't reproduce this on my >> machine though: >> > > OK, sounds more like a system issue then. > >> library(Rsamtools) >> library(RNAseqData.HNRNPC.bam.chr14) >> bamfile <- BamFile(RNAseqData.HNRNPC.bam.chr14_BAMFILES[1L]) >> yieldSize(bamfile) <- 100000L >> open(bamfile) >> out <- GappedAlignments() >> >> Then: >> >>> chunk <- readBamGappedAlignments(bamfile) >>> system.time(out <- append(out, chunk)) >> user system elapsed >> 0.284 0.000 0.286 >> >> I wonder what's going on on your system. Are you sure it was not running >> out of memory when you did this? > > Yes, that's a fat node with 0.2TB RAM and I was the only one on it at the > time. > >> Try to check the load with uptime or >> top in another terminal (e.g. start top right before you call append()). >> If the system starts swapping, then your R process will become hundreds >> or thousands times slower! > > and there was no memory intensive job running. Could still have been some NFS > related issue. I will retry with a fresh session and monitor the I/O as well. > >> >>> >>> whereas the second iteration (faked here) takes only (still long): >>> >>> system.time(append(chunk,chunk)) >>> >>> user system elapsed >>> 2.708 0.044 2.758 >> >> 2nd, 3rd and 4th iterations for me: >> >>> chunk <- readBamGappedAlignments(bamfile) >>> system.time(out <- append(out, chunk)) >> user system elapsed >> 0.516 0.004 0.521 >> >>> chunk <- readBamGappedAlignments(bamfile) >>> system.time(out <- append(out, chunk)) >> user system elapsed >> 0.656 0.008 0.663 >> >>> chunk <- readBamGappedAlignments(bamfile) >>> system.time(out <- append(out, chunk)) >> user system elapsed >> 0.796 0.004 0.801 >> >> As expected, the time is growing (this is why the process >> of growing an object in a loop is considered to be quadratic >> in time). > > Quadratic! Wow, I knew it was slower but still... Good to know. > >> >>> >>> I suppose this has to do with the way >>> GenomicRanges:::unlist_list_of_GappedAlignments deals with combining the >>> objects and all the related sanity checks. For the first iteration, the >>> seqlengths are different so I suppose that is what explains the 60X lag >>> compared to the second iteration. >> >> The seqinfo of the 2 objects to combine need to be merged together >> and set back on each object before the 2 objects can actually >> be combined. This operation is cheap and I wouldn't expect this >> to slow down the first iteration significantly. > > Yes, that was very surprising. > >> >>> Due to the implementation of GappedAlignments, I can't set the seqlengths >>> programmatically in GappedAlignments() which I imagine would have reduced >>> the first iteration lag; see the trials below: >>> >>> out <- GappedAlignments(seqlengths=seqlengths(chunk)) >>> >>> Error in GappedAlignments(seqlengths = seqlengths(chunk)) : >>> 'names(seqlengths)' incompatible with 'levels(seqnames)' >>> >>> out <- >>> GappedAlignments(seqlengths=seqlengths(chunk),seqnames=seqnames(chunk)) >>> >>> Error in GappedAlignments(seqlengths = seqlengths(chunk), seqnames = >>> seqnames(chunk)) : >>> 'strand' must be specified when 'seqnames' is not empty >>> >>> out <- >>> GappedAlignments(seqlengths=seqlengths(chunk),seqnames=seqnames(chunk),strand="+") >>> >>> Error in validObject(.Object) : >>> invalid class “GappedAlignments” object: 1: invalid object for slot >>> "strand" in class "GappedAlignments": got class "character", should be or >>> extend class "Rle" >>> invalid class “GappedAlignments” object: 2: number of rows in DataTable >>> 'mcols(x)' must match length of 'x' >> >> The trick is to create an empty GappedAlignments objects >> with non-empty seqlevels so you can put seqlengths on the >> seqlevels. >> >> Here are 2 ways to create an empty GappedAlignments objects with >> non-empty seqlevels: >> >> (1) Pass an empty factor with non-empty levels to the seqnames >> arg: >> >> out <- GappedAlignments(seqnames=factor(levels=seqlevels(chunk))) >> >> (2) The recommended way: >> >> out <- GappedAlignments() >> seqinfo(out) <- seqinfo(chunk) >> >> Note that with (2), 'out' gets all the seqinfo from 'chunk' (including >> its seqlengths), not only its seqlevels. >> >> (1) could be adapted to also set the seqlengths: >> >> out <- GappedAlignments(seqnames=factor(levels=seqlevels(chunk)), >> seqlengths=seqlengths(chunk)) >> >> but (2) is really the preferred way. > > Thanks for the pointers! > >> >>> >>> I completely approve of such sanity checks; it seems that I'm just trying >>> to do something that it was not designed for :-) All I'm really interested >>> in is a way to stream my BAM file and I'm looking forward to any >>> suggestion. I especially don't want to re-invent the wheel if you have >>> already planned something. If you haven't I'd be glad to get some insight >>> how I can walk around that problem. >>> >>> My sessionInfo: >>> >>> R version 3.0.1 (2013-05-16) >>> Platform: x86_64-unknown-linux-gnu (64-bit) >>> >>> locale: >>> [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C >>> [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8 >>> [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8 >>> [7] LC_PAPER=C LC_NAME=C >>> [9] LC_ADDRESS=C LC_TELEPHONE=C >>> [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C >>> >>> attached base packages: >>> [1] parallel stats graphics grDevices utils datasets methods >>> [8] base >>> >>> other attached packages: >>> [1] BiocInstaller_1.11.3 Rsamtools_1.13.22 Biostrings_2.29.12 >>> [4] GenomicRanges_1.13.26 XVector_0.1.0 IRanges_1.19.15 >>> [7] BiocGenerics_0.7.2 >>> >>> loaded via a namespace (and not attached): >>> [1] bitops_1.0-5 stats4_3.0.1 zlibbioc_1.7.0 >> >> >> Looks like you are using Bioc-devel. Did you get all the >> warnings about GappedAlignments, readBamGappedAlignments(), >> and GappedAlignments() being deprecated? > > I though I did, but indeed I didn't get the warnings then. This is very > strange. > >> >> I thought you were using the release so that's what I used: >> >>> sessionInfo() >> R version 3.0.0 (2013-04-03) >> Platform: x86_64-unknown-linux-gnu (64-bit) >> >> locale: >> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 >> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 >> [7] LC_PAPER=C LC_NAME=C >> [9] LC_ADDRESS=C LC_TELEPHONE=C >> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >> >> attached base packages: >> [1] parallel stats graphics grDevices utils datasets methods >> [8] base >> >> other attached packages: >> [1] RNAseqData.HNRNPC.bam.chr14_0.1.3 Rsamtools_1.12.3 >> [3] Biostrings_2.28.0 GenomicRanges_1.12.4 >> [5] IRanges_1.18.1 BiocGenerics_0.6.0 >> >> loaded via a namespace (and not attached): >> [1] bitops_1.0-5 stats4_3.0.0 zlibbioc_1.6.0 >> >> >> The timings I get with Bioc-devel are pretty much the same though. >> >> Something doesn't seem to be quite right with your cluster. > > I agree, I'll check that out. > >> What happens >> if you try to rbind() 2 data.frames of 100000 rows each in a fresh >> session? >> >>> df <- data.frame(aa=1:100000, bb=100000:1, cc="cc", dd="dd") >>> system.time(df2 <- rbind(df, df)) >> user system elapsed >> 0.204 0.000 0.206 >> > > Good point. I'll try that out and let you know. > > Thanks for the very detailed answer! > > Cheers, > > Nico > > >> Thanks, >> H. >> >>> >>> Cheers, >>> >>> Nico >>> >>> --------------------------------------------------------------- >>> Nicolas Delhomme >>> >>> Genome Biology Computational Support >>> >>> European Molecular Biology Laboratory >>> >>> Tel: +49 6221 387 8310 >>> Email: nicolas.delho...@embl.de >>> Meyerhofstrasse 1 - Postfach 10.2209 >>> 69102 Heidelberg, Germany >>> >>> _______________________________________________ >>> Bioc-devel@r-project.org mailing list >>> https://stat.ethz.ch/mailman/listinfo/bioc-devel >>> >> >> -- >> Hervé Pagès >> >> Program in Computational Biology >> Division of Public Health Sciences >> Fred Hutchinson Cancer Research Center >> 1100 Fairview Ave. N, M1-B514 >> P.O. Box 19024 >> Seattle, WA 98109-1024 >> >> E-mail: hpa...@fhcrc.org >> Phone: (206) 667-5791 >> Fax: (206) 667-1319 > > _______________________________________________ > Bioc-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/bioc-devel _______________________________________________ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel