This has come up several times I think, and I'm totally naive, but what about making R smart enough to do that itself? One nice thing about the vectorization is that R knows how many CHARSXPs it needs to allocate up-front, or at least an upper-bound when they're all unique. Is the issue that the R heap needs to be initialized in a special way and its behavior can't change at run-time? It's a tricky thing to get right, I'm sure.
On Fri, Aug 2, 2013 at 4:24 AM, Martin Morgan <mtmor...@fhcrc.org> wrote: > On 07/31/2013 04:05 PM, Hervé Pagès wrote: > >> Hi Dario, >> >> Thanks for providing the file. >> >> The file has 55780092 records in it and yes it only contains primary >> alignments so this is the best situation for the pairing algorithm. >> >> Here are the timings I get for the readBamGappedAlignmentPairs() >> function (renamed readGAlignmentPairsFromBam() in devel) on a 64-bit >> Ubuntu server with 384 Gb of RAM: >> >> ## With BioC release (Rsamtools 1.12.3) >> ## ------------------------------**------ >> ## Timings: >> ## - readBamGappedAlignments: 10 min 30 s (A=A1+A2) >> ## - scanBam: 6 min (A1) >> > > I haven't looked at Dario's data directly, but a surprising bottleneck is > the creation of large character vectors > > $ R --vanilla > [...] > > i = seq_len(55780092/2) > > system.time(as.character(i)) > user system elapsed > 74.880 0.408 75.451 > > This can be alleviated by telling R that that you're likely to need lots > of memory up front; I alias R so that > > $ R --min-vsize=2048M --min-nsize=20M --vanilla > [...] > > i = seq_len(55780092/2) > > system.time(as.character(i)) > user system elapsed > 25.269 0.472 25.796 > > or > > $ R_GC_MEM_GROW=3 R --min-vsize=2048M --min-nsize=20M --vanilla > [...] > > i = seq_len(55780092/2) > > system.time(as.character(i)) > user system elapsed > 12.196 0.464 12.687 > > These are documented on ?Memory and with > > > R.version.string > [1] "R version 3.0.1 Patched (2013-06-05 r62876)" > > > Martin > > > ## - other: 4 min 30 s (A2) >> ## - makeGappedAlignmentPairs: 5 min 30 s (B=B1+B2) >> ## - findMateAlignment: 1 min 30 s (B1) >> ## - other: 4 min (B2) >> ## Total time: 16 min (A+B) >> ## Memory usage: 13 Gb >> >> ## With BioC devel (Rsamtools 1.13.19) >> ## ------------------------------**----- >> ## Timings: >> ## - readGAlignmentsFromBam: 11 min (A=A1+A2) >> ## - scanBam: 6 min (A1) >> ## - other: 5 min (A2) >> ## - makeGAlignmentPairs: 4 min (B=B1+B2) >> ## - findMateAlignment: 1 min (B1) >> ## - other: 3 min (B2) >> ## Total time: 15 min (A+B) >> ## Memory usage: 13 Gb >> >> Indeed, not much difference between release and devel :-/ >> >> I didn't have the opportunity to do this kind of detailed timing of >> the readGAlignmentPairsFromBam() function on such a big file before, >> but I find it pretty interesting (I only benchmarked findMateAlignment() >> and it was on simulated data). >> >> In particular I'm quite surprised to discover that the pairing algo >> (implemented in findMateAlignment) is actually not the bottleneck >> (and it being optimized in BioC devel explains the small difference >> between total time in release and total time in devel). >> >> Just a note on the timings you sent earlier (I'm copying them here >> again): >> >> > system.time(RNAreads <- readBamGappedAlignmentPairs(**file)) >> user system elapsed >> 1852.16 59.00 1939.11 >> > system.time(RNAreads <- readBamGappedAlignments(file)) >> user system elapsed >> 192.80 8.12 214.97 >> >> Subtracting the 2 times above to infer the time spent for the pairing >> is a shortcut that overlooks the following: when >> readBamGappedAlignmentPairs() calls readBamGappedAlignments() >> internally, it calls it with 'use.names=TRUE', and also with a 'param' >> arg that specifies the extra BAM fields, and also with a specific flag >> filter. All those things are required for the pairing algo. So a more >> sensible comparison is to compare: >> >> RNAreads <- readBamGappedAlignmentPairs(**file) >> >> versus: >> >> flag0 <- scanBamFlag(isPaired=TRUE, isUnmappedQuery=FALSE, >> hasUnmappedMate=FALSE) >> what0 <- c("rname", "strand", "pos", "cigar", "flag", "mrnm", "mpos") >> param0 <- ScanBamParam(flag=flag0, what=what0) >> RNAreads <- readBamGappedAlignments(file, use.names=TRUE, param=param0) >> >> This internal call to readBamGappedAlignments() takes more than 70% >> of the total time of readBamGappedAlignmentPairs() (11 min / 15 min >> with BioC devel on my Linux server). This is a little bit surprising >> to me and that's something I'd like to investigate. >> >> Cheers, >> H. >> >> >> On 07/30/2013 05:00 PM, Dario Strbenac wrote: >> >>> Hi. >>> >>> The files are >>> >>> http://www.maths.usyd.edu.au/**u/dario/RNAAlignedSorted.bam<http://www.maths.usyd.edu.au/u/dario/RNAAlignedSorted.bam> >>> http://www.maths.usyd.edu.au/**u/dario/RNAAlignedSorted.bam.**bai<http://www.maths.usyd.edu.au/u/dario/RNAAlignedSorted.bam.bai> >>> >>> While you have them, could you also investigate the granges coercion >>> function >>> ? It also took half an hour. >>> >>> ______________________________**__________ >>> From: Hervé Pagès [hpa...@fhcrc.org] >>> Sent: Tuesday, 30 July 2013 3:29 AM >>> To: Dario Strbenac >>> Cc: bioc-devel@r-project.org >>> Subject: Re: [Bioc-devel] Running Time of readBamGappedAlignmentPairs >>> >>> On 07/29/2013 12:00 AM, Dario Strbenac wrote: >>> >>>> Because I only allowed unique mappings, the ratio is 2. After >>>> installing the >>>> development package versions, I only got a 10% improvement. >>>> >>> >>> Mmmh that's disappointing. Can you put the file somewhere so I can have >>> a look? Thanks. >>> >>> H. >>> >>> >>>> user system elapsed >>>> 1681.36 65.79 1752.10 >>>> >>>> ______________________________**__________ >>>> From: Pages, Herve [hpa...@fhcrc.org] >>>> Sent: Monday, 29 July 2013 3:29 PM >>>> To: Dario Strbenac >>>> Cc: bioc-devel@r-project.org >>>> Subject: Re: [Bioc-devel] Running Time of readBamGappedAlignmentPairs >>>> >>>> Hi Dario, >>>> >>>> The function was optimized in Bioc-devel. Depending on the number >>>> of records in your BAM file (which is more relevant than its size), >>>> it should now be between 2x and 5x faster. Please give it a try and >>>> let us know. >>>> >>>> Also keep in mind that the pairing algo will slow down if >>>> the "average nb of records per unique QNAME" is 3 or more. >>>> This was discussed here last month: >>>> https://stat.ethz.ch/**pipermail/bioconductor/2013-** >>>> June/053390.html<https://stat.ethz.ch/pipermail/bioconductor/2013-June/053390.html> >>>> >>>> Cheers, >>>> H. >>>> >>>> >>>> ----- Original Message ----- >>>> From: "Dario Strbenac" <dstr7...@uni.sydney.edu.au> >>>> To: bioc-devel@r-project.org >>>> Sent: Sunday, July 28, 2013 9:00:30 PM >>>> Subject: [Bioc-devel] Running Time of readBamGappedAlignmentPairs >>>> >>>> Hello, >>>> >>>> Could readBamGappedAlignmentPairs be optimised ? It's surprising that it >>>> takes an extra 29 minutes to calculate the pairing information, for the >>>> example below. The BAM file is 4 GB in size. >>>> >>>> system.time(RNAreads <- readBamGappedAlignmentPairs(**file)) >>>>> >>>> user system elapsed >>>> 1852.16 59.00 1939.11 >>>> >>>>> system.time(RNAreads <- readBamGappedAlignments(file)) >>>>> >>>> user system elapsed >>>> 192.80 8.12 214.97 >>>> >>>> ------------------------------**-------- >>>> Dario Strbenac >>>> PhD Student >>>> University of Sydney >>>> Camperdown NSW 2050 >>>> Australia >>>> >>>> ______________________________**_________________ >>>> Bioc-devel@r-project.org mailing list >>>> https://stat.ethz.ch/mailman/**listinfo/bioc-devel<https://stat.ethz.ch/mailman/listinfo/bioc-devel> >>>> >>>> >>>> ______________________________**_________________ >>>> Bioc-devel@r-project.org mailing list >>>> https://stat.ethz.ch/mailman/**listinfo/bioc-devel<https://stat.ethz.ch/mailman/listinfo/bioc-devel> >>>> >>>> >>> -- >>> Hervé Pagès >>> >>> Program in Computational Biology >>> Division of Public Health Sciences >>> Fred Hutchinson Cancer Research Center >>> 1100 Fairview Ave. N, M1-B514 >>> P.O. Box 19024 >>> Seattle, WA 98109-1024 >>> >>> E-mail: hpa...@fhcrc.org >>> Phone: (206) 667-5791 >>> Fax: (206) 667-1319 >>> >>> >>> >> > > -- > Computational Biology / Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N. > PO Box 19024 Seattle, WA 98109 > > Location: Arnold Building M1 B861 > Phone: (206) 667-2793 > > > ______________________________**_________________ > Bioc-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/**listinfo/bioc-devel<https://stat.ethz.ch/mailman/listinfo/bioc-devel> > [[alternative HTML version deleted]]
_______________________________________________ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel