Hej Hervé! --------------------------------------------------------------- Nicolas Delhomme
Genome Biology Computational Support European Molecular Biology Laboratory Tel: +49 6221 387 8310 Email: nicolas.delho...@embl.de Meyerhofstrasse 1 - Postfach 10.2209 69102 Heidelberg, Germany --------------------------------------------------------------- On Jul 10, 2013, at 8:54 PM, Hervé Pagès wrote: > Hi Nico, > > On 07/09/2013 08:07 AM, Nicolas Delhomme wrote: >> Hej Bioc Core! >> >> There was some discussion last year about implementing a BamStreamer (à la >> FastqStreamer), but I haven't seen anything like it in the current devel. >> I've implemented the following function that should do the job for me - I >> have many very large files, and I need to use a cluster with relatively few >> RAM per node and a restrictive time allocation , so I want to parallelize >> the reading of the BAM file to manage both. The example below is obviously >> not affecting the RAM issue but I streamlined it to point out my issue. >> >> ".stream" <- function(bamFile,yieldSize=100000,verbose=FALSE){ >> >> ## create a stream >> stopifnot(is(bamFile,"BamFile")) >> >> ## set the yieldSize if it is not set already >> if(is.na(yieldSize(bamFile))){ >> yieldSize(bamFile) <- yieldSize >> } >> >> ## open it >> open(bamFile) >> >> ## verb >> if(verbose){ >> message(paste("Streaming",basename(path(bamFile)))) >> } >> >> ## create the output >> out <- GappedAlignments() >> >> ## process it >> while(length(chunk <- readBamGappedAlignments(bamFile))){ >> if(verbose){ >> message(paste("Processed",length(chunk),"reads")) >> } >> out <- c(out,chunk) >> } > > Note that regardless the speed of c() on GappedAlignments objects, > growing an object in a loop is fundamentally inefficient (see Circle 2 > of The R Inferno). > Also keeping the chunks in memory kind of defeats the purpose of reading > the file one chunk at a time. Sure. What this function normally really does is a data reduction - basically getting a named vector back. I just came across the appending issue when preparing the code example above. > >> >> ## close >> close(bamFile) >> >> ## return >> return(out) >> } >> >> In the method above, the first iteration of combining the GappedAlignments: >> >> out <- c(out,chunk) takes: >> >> system.time(append(out,chunk)) >> >> user system elapsed >> 123.704 0.060 124.011 > > 2 minutes! Whaoo, that's really slow. I can't reproduce this on my > machine though: > OK, sounds more like a system issue then. > library(Rsamtools) > library(RNAseqData.HNRNPC.bam.chr14) > bamfile <- BamFile(RNAseqData.HNRNPC.bam.chr14_BAMFILES[1L]) > yieldSize(bamfile) <- 100000L > open(bamfile) > out <- GappedAlignments() > > Then: > > > chunk <- readBamGappedAlignments(bamfile) > > system.time(out <- append(out, chunk)) > user system elapsed > 0.284 0.000 0.286 > > I wonder what's going on on your system. Are you sure it was not running > out of memory when you did this? Yes, that's a fat node with 0.2TB RAM and I was the only one on it at the time. > Try to check the load with uptime or > top in another terminal (e.g. start top right before you call append()). > If the system starts swapping, then your R process will become hundreds > or thousands times slower! and there was no memory intensive job running. Could still have been some NFS related issue. I will retry with a fresh session and monitor the I/O as well. > >> >> whereas the second iteration (faked here) takes only (still long): >> >> system.time(append(chunk,chunk)) >> >> user system elapsed >> 2.708 0.044 2.758 > > 2nd, 3rd and 4th iterations for me: > > > chunk <- readBamGappedAlignments(bamfile) > > system.time(out <- append(out, chunk)) > user system elapsed > 0.516 0.004 0.521 > > > chunk <- readBamGappedAlignments(bamfile) > > system.time(out <- append(out, chunk)) > user system elapsed > 0.656 0.008 0.663 > > > chunk <- readBamGappedAlignments(bamfile) > > system.time(out <- append(out, chunk)) > user system elapsed > 0.796 0.004 0.801 > > As expected, the time is growing (this is why the process > of growing an object in a loop is considered to be quadratic > in time). Quadratic! Wow, I knew it was slower but still... Good to know. > >> >> I suppose this has to do with the way >> GenomicRanges:::unlist_list_of_GappedAlignments deals with combining the >> objects and all the related sanity checks. For the first iteration, the >> seqlengths are different so I suppose that is what explains the 60X lag >> compared to the second iteration. > > The seqinfo of the 2 objects to combine need to be merged together > and set back on each object before the 2 objects can actually > be combined. This operation is cheap and I wouldn't expect this > to slow down the first iteration significantly. Yes, that was very surprising. > >> Due to the implementation of GappedAlignments, I can't set the seqlengths >> programmatically in GappedAlignments() which I imagine would have reduced >> the first iteration lag; see the trials below: >> >> out <- GappedAlignments(seqlengths=seqlengths(chunk)) >> >> Error in GappedAlignments(seqlengths = seqlengths(chunk)) : >> 'names(seqlengths)' incompatible with 'levels(seqnames)' >> >> out <- >> GappedAlignments(seqlengths=seqlengths(chunk),seqnames=seqnames(chunk)) >> >> Error in GappedAlignments(seqlengths = seqlengths(chunk), seqnames = >> seqnames(chunk)) : >> 'strand' must be specified when 'seqnames' is not empty >> >> out <- >> GappedAlignments(seqlengths=seqlengths(chunk),seqnames=seqnames(chunk),strand="+") >> >> Error in validObject(.Object) : >> invalid class “GappedAlignments” object: 1: invalid object for slot >> "strand" in class "GappedAlignments": got class "character", should be or >> extend class "Rle" >> invalid class “GappedAlignments” object: 2: number of rows in DataTable >> 'mcols(x)' must match length of 'x' > > The trick is to create an empty GappedAlignments objects > with non-empty seqlevels so you can put seqlengths on the > seqlevels. > > Here are 2 ways to create an empty GappedAlignments objects with > non-empty seqlevels: > > (1) Pass an empty factor with non-empty levels to the seqnames > arg: > > out <- GappedAlignments(seqnames=factor(levels=seqlevels(chunk))) > > (2) The recommended way: > > out <- GappedAlignments() > seqinfo(out) <- seqinfo(chunk) > > Note that with (2), 'out' gets all the seqinfo from 'chunk' (including > its seqlengths), not only its seqlevels. > > (1) could be adapted to also set the seqlengths: > > out <- GappedAlignments(seqnames=factor(levels=seqlevels(chunk)), > seqlengths=seqlengths(chunk)) > > but (2) is really the preferred way. Thanks for the pointers! > >> >> I completely approve of such sanity checks; it seems that I'm just trying to >> do something that it was not designed for :-) All I'm really interested in >> is a way to stream my BAM file and I'm looking forward to any suggestion. I >> especially don't want to re-invent the wheel if you have already planned >> something. If you haven't I'd be glad to get some insight how I can walk >> around that problem. >> >> My sessionInfo: >> >> R version 3.0.1 (2013-05-16) >> Platform: x86_64-unknown-linux-gnu (64-bit) >> >> locale: >> [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C >> [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8 >> [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8 >> [7] LC_PAPER=C LC_NAME=C >> [9] LC_ADDRESS=C LC_TELEPHONE=C >> [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C >> >> attached base packages: >> [1] parallel stats graphics grDevices utils datasets methods >> [8] base >> >> other attached packages: >> [1] BiocInstaller_1.11.3 Rsamtools_1.13.22 Biostrings_2.29.12 >> [4] GenomicRanges_1.13.26 XVector_0.1.0 IRanges_1.19.15 >> [7] BiocGenerics_0.7.2 >> >> loaded via a namespace (and not attached): >> [1] bitops_1.0-5 stats4_3.0.1 zlibbioc_1.7.0 > > > Looks like you are using Bioc-devel. Did you get all the > warnings about GappedAlignments, readBamGappedAlignments(), > and GappedAlignments() being deprecated? I though I did, but indeed I didn't get the warnings then. This is very strange. > > I thought you were using the release so that's what I used: > > > sessionInfo() > R version 3.0.0 (2013-04-03) > Platform: x86_64-unknown-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 > [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 > [7] LC_PAPER=C LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] parallel stats graphics grDevices utils datasets methods > [8] base > > other attached packages: > [1] RNAseqData.HNRNPC.bam.chr14_0.1.3 Rsamtools_1.12.3 > [3] Biostrings_2.28.0 GenomicRanges_1.12.4 > [5] IRanges_1.18.1 BiocGenerics_0.6.0 > > loaded via a namespace (and not attached): > [1] bitops_1.0-5 stats4_3.0.0 zlibbioc_1.6.0 > > > The timings I get with Bioc-devel are pretty much the same though. > > Something doesn't seem to be quite right with your cluster. I agree, I'll check that out. > What happens > if you try to rbind() 2 data.frames of 100000 rows each in a fresh > session? > > > df <- data.frame(aa=1:100000, bb=100000:1, cc="cc", dd="dd") > > system.time(df2 <- rbind(df, df)) > user system elapsed > 0.204 0.000 0.206 > Good point. I'll try that out and let you know. Thanks for the very detailed answer! Cheers, Nico > Thanks, > H. > >> >> Cheers, >> >> Nico >> >> --------------------------------------------------------------- >> Nicolas Delhomme >> >> Genome Biology Computational Support >> >> European Molecular Biology Laboratory >> >> Tel: +49 6221 387 8310 >> Email: nicolas.delho...@embl.de >> Meyerhofstrasse 1 - Postfach 10.2209 >> 69102 Heidelberg, Germany >> >> _______________________________________________ >> Bioc-devel@r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/bioc-devel >> > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpa...@fhcrc.org > Phone: (206) 667-5791 > Fax: (206) 667-1319 _______________________________________________ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel