Hi Martin, On Jul 10, 2013, at 8:40 PM, Martin Morgan wrote:
> > ----- Nicolas Delhomme <delho...@embl.de> wrote: >> Hej Bioc Core! >> >> There was some discussion last year about implementing a BamStreamer (à la >> FastqStreamer), but I haven't seen anything like it in the current devel. >> I've implemented the following function that should do the job for me - I >> have many very large files, and I need to use a cluster with relatively few >> RAM per node and a restrictive time allocation , so I want to parallelize >> the reading of the BAM file to manage both. The example below is obviously >> not affecting the RAM issue but I streamlined it to point out my issue. > > Hi Nico -- > > I had viewed the 'BamStreamer' functionality to be implemented by > > bf <- open(BamFile("foo.bam", yieldSize=1234567)) > while (length(x <- readGAlignments(bf))) {} > close(bf) > Good to see I recalled your Cambridge teaching correctly ;-) > paradigm, without wanting to wrap it further (FastqStreamer isn't any more > convenient) because the tough work will be done in {} and one would have to > create some sort of complicated rule about function signatures for > implementing this as a call-back. I usually implement '{}' as a reduction > where all the work is done (as with summarizeOverlaps, where the bam data is > reduced to a count vector and then summed across iterations through the loop) > -- it doesn't seem like there's any memory management gains to be had by > concatenating GappedAlignments (these are renamed GAlignments in devel) > together; I agree and that's what I meant by having created a streamlined version of my function so has to show the paradigm, not the actual data reduction steps. > usually I'd parallelize over files > > bfl <- open(BamFileList(fls, yieldSize=1234567))) > result <- mclapply(bfl, function(bf, anno, ...) { > ## initialize, e.g., cnt <- integer(length(anno))) > while(length(x <- readGAlignments(bf))) {} > ## return, e.g., anno > }) > close(bfl) So do I. > > This doesn't address the efficiency of appending GAlignments. This was really a side effect that appeared when I was preparing the code example. I'll go through Hervé's answer about this, they might indeed have been an issue with R version (devel or not), as I didn't get the GAlignments warnings although the package version numbers seemed to match what's in devel. I'll have a closer look. Thanks, Nico > > Martin > >> >> ".stream" <- function(bamFile,yieldSize=100000,verbose=FALSE){ >> >> ## create a stream >> stopifnot(is(bamFile,"BamFile")) >> >> ## set the yieldSize if it is not set already >> if(is.na(yieldSize(bamFile))){ >> yieldSize(bamFile) <- yieldSize >> } >> >> ## open it >> open(bamFile) >> >> ## verb >> if(verbose){ >> message(paste("Streaming",basename(path(bamFile)))) >> } >> >> ## create the output >> out <- GappedAlignments() >> >> ## process it >> while(length(chunk <- readBamGappedAlignments(bamFile))){ >> if(verbose){ >> message(paste("Processed",length(chunk),"reads")) >> } >> out <- c(out,chunk) >> } >> >> ## close >> close(bamFile) >> >> ## return >> return(out) >> } >> >> In the method above, the first iteration of combining the GappedAlignments: >> >> out <- c(out,chunk) takes: >> >> system.time(append(out,chunk)) >> >> user system elapsed >> 123.704 0.060 124.011 >> >> whereas the second iteration (faked here) takes only (still long): >> >> system.time(append(chunk,chunk)) >> >> user system elapsed >> 2.708 0.044 2.758 >> >> I suppose this has to do with the way >> GenomicRanges:::unlist_list_of_GappedAlignments deals with combining the >> objects and all the related sanity checks. For the first iteration, the >> seqlengths are different so I suppose that is what explains the 60X lag >> compared to the second iteration. Due to the implementation of >> GappedAlignments, I can't set the seqlengths programmatically in >> GappedAlignments() which I imagine would have reduced the first iteration >> lag; see the trials below: >> >> out <- GappedAlignments(seqlengths=seqlengths(chunk)) >> >> Error in GappedAlignments(seqlengths = seqlengths(chunk)) : >> 'names(seqlengths)' incompatible with 'levels(seqnames)' >> >> out <- >> GappedAlignments(seqlengths=seqlengths(chunk),seqnames=seqnames(chunk)) >> >> Error in GappedAlignments(seqlengths = seqlengths(chunk), seqnames = >> seqnames(chunk)) : >> 'strand' must be specified when 'seqnames' is not empty >> >> out <- >> GappedAlignments(seqlengths=seqlengths(chunk),seqnames=seqnames(chunk),strand="+") >> >> Error in validObject(.Object) : >> invalid class “GappedAlignments” object: 1: invalid object for slot "strand" >> in class "GappedAlignments": got class "character", should be or extend >> class "Rle" >> invalid class “GappedAlignments” object: 2: number of rows in DataTable >> 'mcols(x)' must match length of 'x' >> >> I completely approve of such sanity checks; it seems that I'm just trying to >> do something that it was not designed for :-) All I'm really interested in >> is a way to stream my BAM file and I'm looking forward to any suggestion. I >> especially don't want to re-invent the wheel if you have already planned >> something. If you haven't I'd be glad to get some insight how I can walk >> around that problem. >> >> My sessionInfo: >> >> R version 3.0.1 (2013-05-16) >> Platform: x86_64-unknown-linux-gnu (64-bit) >> >> locale: >> [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C >> [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8 >> [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8 >> [7] LC_PAPER=C LC_NAME=C >> [9] LC_ADDRESS=C LC_TELEPHONE=C >> [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C >> >> attached base packages: >> [1] parallel stats graphics grDevices utils datasets methods >> [8] base >> >> other attached packages: >> [1] BiocInstaller_1.11.3 Rsamtools_1.13.22 Biostrings_2.29.12 >> [4] GenomicRanges_1.13.26 XVector_0.1.0 IRanges_1.19.15 >> [7] BiocGenerics_0.7.2 >> >> loaded via a namespace (and not attached): >> [1] bitops_1.0-5 stats4_3.0.1 zlibbioc_1.7.0 >> >> Cheers, >> >> Nico >> >> --------------------------------------------------------------- >> Nicolas Delhomme >> >> Genome Biology Computational Support >> >> European Molecular Biology Laboratory >> >> Tel: +49 6221 387 8310 >> Email: nicolas.delho...@embl.de >> Meyerhofstrasse 1 - Postfach 10.2209 >> 69102 Heidelberg, Germany >> >> _______________________________________________ >> Bioc-devel@r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/bioc-devel > _______________________________________________ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel