Re: [Bioc-sig-seq] Loading large BAM files
On 07/14/2011 08:47 AM, Ivan Gregoretti wrote: Hello Martin, I am currently reading in data with scanBam and making use of the ScanBamParam function. My bam data looks like this: names(bam1) [1] "chr1:1-197195432" "chr2:1-181748087" "chr3:1-159599783" class(bam1) [1] "list" class(bam1$"chr1:1-197195432") [1] "list" class(bam1$"chr1:1-197195432"$qname) [1] "character" How do we consolidate all three lists (chr1, chr2, chr3) into a single list? I read the Rsamtools documentation but the examples do not quite apply to this case. These are lists-of-lists, and the somewhat opaque way of getting elements out is along the lines of unlist(lapply(bam1, "[[", "qname")) This paradigm seems not to work for XStringSet, for which do.call(c, unname(lapply(bam1, "[[", "seq"))) Martin Thank you, Ivan On Wed, Jul 13, 2011 at 5:25 PM, Ivan Gregoretti wrote: Hi Martin, I will have to play a little bit both with the counter function and with simplify2array. It's not evident for me what they are intended to do. I can answer your question about BAM size. Our samples are all from Illumina HiSeq now. So, we are talking about no less than 100 million good quality tags per lane. In different occasions I have run into the problem of the 2^31 -1 limit and I am always thinking about ways to process large amounts of information by chunks and in parallel. With time, sequencers will only output more tags so we will will run into this problems only more often. The suggestion of using ScanBamParam is good, thank you, but at this early stage of sample analysis I need to load the whole BAM. Some sample quality assessments are in order. Thank you, Ivan On Wed, Jul 13, 2011 at 5:04 PM, Martin Morgan wrote: On 07/13/2011 01:57 PM, Martin Morgan wrote: On 07/13/2011 01:36 PM, Ivan Gregoretti wrote: Hi everybody, As I wait for my large BAM to be read in by scanBAM, I can't help but to wonder: Has anybody tried combining scanBam with multicore to load the chromosomes in parallel? That would require 1) to merge the chunks at the end and 2) the original BAM to be indexed. Does anybody have any experience to share? Was wondering how large and long we're talking about? Use of ScanBamParam(what=...) can help. For some tasks I'd think of a coarser granularity, e.g., in the context of multiple bam files so that the data reduction (to a vector of 10,000's of counts) occurs on each core. counter = function(fl, genes) { aln = readGappedAlignments(fl) strand(aln) = "*" hits = countOverlaps(aln, genes) countOverlaps(genes, aln[hits==1]) } simplify2array(mclapply(bamFiles, counter, genes)) One issue I understand people have is that mclapply uses 'serialize()' to convert the return value of each function to a raw vector. raw vectors have the same total length limit as any other vector (2^31 -1 elements) and this places a limit on the size of chunk returned by each core. Also I believe that exceeding the limit can silently corrupt the data (i.e., a bug). This is second-hand information. Should also have mentioned that casting a GRanges object to RangesList provides the appropriate list to iterate over chromosomes, and that ScanBamParam will accept a which of class RangesList. Martin Martin Thank you, Ivan ___ Bioc-sig-sequencing mailing list Bioc-sig-sequencing@r-project.org https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing -- Computational Biology Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: M1-B861 Telephone: 206 667-2793 -- Computational Biology Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: M1-B861 Telephone: 206 667-2793 ___ Bioc-sig-sequencing mailing list Bioc-sig-sequencing@r-project.org https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
Re: [Bioc-sig-seq] Loading large BAM files
Hello Martin, I am currently reading in data with scanBam and making use of the ScanBamParam function. My bam data looks like this: > names(bam1) [1] "chr1:1-197195432" "chr2:1-181748087" "chr3:1-159599783" > class(bam1) [1] "list" > class(bam1$"chr1:1-197195432") [1] "list" > class(bam1$"chr1:1-197195432"$qname) [1] "character" How do we consolidate all three lists (chr1, chr2, chr3) into a single list? I read the Rsamtools documentation but the examples do not quite apply to this case. Thank you, Ivan On Wed, Jul 13, 2011 at 5:25 PM, Ivan Gregoretti wrote: > Hi Martin, > > I will have to play a little bit both with the counter function and > with simplify2array. It's not evident for me what they are intended to > do. > > I can answer your question about BAM size. Our samples are all from > Illumina HiSeq now. So, we are talking about no less than 100 million > good quality tags per lane. > > In different occasions I have run into the problem of the 2^31 -1 > limit and I am always thinking about ways to process large amounts of > information by chunks and in parallel. With time, sequencers will only > output more tags so we will will run into this problems only more > often. > > The suggestion of using ScanBamParam is good, thank you, but at this > early stage of sample analysis I need to load the whole BAM. Some > sample quality assessments are in order. > > Thank you, > > Ivan > > > > > On Wed, Jul 13, 2011 at 5:04 PM, Martin Morgan wrote: >> On 07/13/2011 01:57 PM, Martin Morgan wrote: >>> >>> On 07/13/2011 01:36 PM, Ivan Gregoretti wrote: Hi everybody, As I wait for my large BAM to be read in by scanBAM, I can't help but to wonder: Has anybody tried combining scanBam with multicore to load the chromosomes in parallel? That would require 1) to merge the chunks at the end and 2) the original BAM to be indexed. Does anybody have any experience to share? >>> >>> Was wondering how large and long we're talking about? >>> >>> Use of ScanBamParam(what=...) can help. >>> >>> For some tasks I'd think of a coarser granularity, e.g., in the context >>> of multiple bam files so that the data reduction (to a vector of >>> 10,000's of counts) occurs on each core. >>> >>> counter = function(fl, genes) { >>> aln = readGappedAlignments(fl) >>> strand(aln) = "*" >>> hits = countOverlaps(aln, genes) >>> countOverlaps(genes, aln[hits==1]) >>> } >>> simplify2array(mclapply(bamFiles, counter, genes)) >>> >>> One issue I understand people have is that mclapply uses 'serialize()' >>> to convert the return value of each function to a raw vector. raw >>> vectors have the same total length limit as any other vector (2^31 -1 >>> elements) and this places a limit on the size of chunk returned by each >>> core. Also I believe that exceeding the limit can silently corrupt the >>> data (i.e., a bug). This is second-hand information. >> >> Should also have mentioned that casting a GRanges object to RangesList >> provides the appropriate list to iterate over chromosomes, and that >> ScanBamParam will accept a which of class RangesList. >> >> Martin >> >>> >>> Martin >>> Thank you, Ivan ___ Bioc-sig-sequencing mailing list Bioc-sig-sequencing@r-project.org https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing >>> >>> >> >> >> -- >> Computational Biology >> Fred Hutchinson Cancer Research Center >> 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 >> >> Location: M1-B861 >> Telephone: 206 667-2793 >> > ___ Bioc-sig-sequencing mailing list Bioc-sig-sequencing@r-project.org https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
Re: [Bioc-sig-seq] Loading large BAM files
Hi Martin, I will have to play a little bit both with the counter function and with simplify2array. It's not evident for me what they are intended to do. I can answer your question about BAM size. Our samples are all from Illumina HiSeq now. So, we are talking about no less than 100 million good quality tags per lane. In different occasions I have run into the problem of the 2^31 -1 limit and I am always thinking about ways to process large amounts of information by chunks and in parallel. With time, sequencers will only output more tags so we will will run into this problems only more often. The suggestion of using ScanBamParam is good, thank you, but at this early stage of sample analysis I need to load the whole BAM. Some sample quality assessments are in order. Thank you, Ivan On Wed, Jul 13, 2011 at 5:04 PM, Martin Morgan wrote: > On 07/13/2011 01:57 PM, Martin Morgan wrote: >> >> On 07/13/2011 01:36 PM, Ivan Gregoretti wrote: >>> >>> Hi everybody, >>> >>> As I wait for my large BAM to be read in by scanBAM, I can't help but >>> to wonder: >>> >>> Has anybody tried combining scanBam with multicore to load the >>> chromosomes in parallel? >>> >>> That would require >>> >>> 1) to merge the chunks at the end and >>> >>> 2) the original BAM to be indexed. >>> >>> Does anybody have any experience to share? >> >> Was wondering how large and long we're talking about? >> >> Use of ScanBamParam(what=...) can help. >> >> For some tasks I'd think of a coarser granularity, e.g., in the context >> of multiple bam files so that the data reduction (to a vector of >> 10,000's of counts) occurs on each core. >> >> counter = function(fl, genes) { >> aln = readGappedAlignments(fl) >> strand(aln) = "*" >> hits = countOverlaps(aln, genes) >> countOverlaps(genes, aln[hits==1]) >> } >> simplify2array(mclapply(bamFiles, counter, genes)) >> >> One issue I understand people have is that mclapply uses 'serialize()' >> to convert the return value of each function to a raw vector. raw >> vectors have the same total length limit as any other vector (2^31 -1 >> elements) and this places a limit on the size of chunk returned by each >> core. Also I believe that exceeding the limit can silently corrupt the >> data (i.e., a bug). This is second-hand information. > > Should also have mentioned that casting a GRanges object to RangesList > provides the appropriate list to iterate over chromosomes, and that > ScanBamParam will accept a which of class RangesList. > > Martin > >> >> Martin >> >>> >>> Thank you, >>> >>> Ivan >>> >>> ___ >>> Bioc-sig-sequencing mailing list >>> Bioc-sig-sequencing@r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing >> >> > > > -- > Computational Biology > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 > > Location: M1-B861 > Telephone: 206 667-2793 > ___ Bioc-sig-sequencing mailing list Bioc-sig-sequencing@r-project.org https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
Re: [Bioc-sig-seq] Loading large BAM files
On 07/13/2011 01:57 PM, Martin Morgan wrote: On 07/13/2011 01:36 PM, Ivan Gregoretti wrote: Hi everybody, As I wait for my large BAM to be read in by scanBAM, I can't help but to wonder: Has anybody tried combining scanBam with multicore to load the chromosomes in parallel? That would require 1) to merge the chunks at the end and 2) the original BAM to be indexed. Does anybody have any experience to share? Was wondering how large and long we're talking about? Use of ScanBamParam(what=...) can help. For some tasks I'd think of a coarser granularity, e.g., in the context of multiple bam files so that the data reduction (to a vector of 10,000's of counts) occurs on each core. counter = function(fl, genes) { aln = readGappedAlignments(fl) strand(aln) = "*" hits = countOverlaps(aln, genes) countOverlaps(genes, aln[hits==1]) } simplify2array(mclapply(bamFiles, counter, genes)) One issue I understand people have is that mclapply uses 'serialize()' to convert the return value of each function to a raw vector. raw vectors have the same total length limit as any other vector (2^31 -1 elements) and this places a limit on the size of chunk returned by each core. Also I believe that exceeding the limit can silently corrupt the data (i.e., a bug). This is second-hand information. Should also have mentioned that casting a GRanges object to RangesList provides the appropriate list to iterate over chromosomes, and that ScanBamParam will accept a which of class RangesList. Martin Martin Thank you, Ivan ___ Bioc-sig-sequencing mailing list Bioc-sig-sequencing@r-project.org https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing -- Computational Biology Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: M1-B861 Telephone: 206 667-2793 ___ Bioc-sig-sequencing mailing list Bioc-sig-sequencing@r-project.org https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
Re: [Bioc-sig-seq] Loading large BAM files
On 07/13/2011 01:36 PM, Ivan Gregoretti wrote: Hi everybody, As I wait for my large BAM to be read in by scanBAM, I can't help but to wonder: Has anybody tried combining scanBam with multicore to load the chromosomes in parallel? That would require 1) to merge the chunks at the end and 2) the original BAM to be indexed. Does anybody have any experience to share? Was wondering how large and long we're talking about? Use of ScanBamParam(what=...) can help. For some tasks I'd think of a coarser granularity, e.g., in the context of multiple bam files so that the data reduction (to a vector of 10,000's of counts) occurs on each core. counter = function(fl, genes) { aln = readGappedAlignments(fl) strand(aln) = "*" hits = countOverlaps(aln, genes) countOverlaps(genes, aln[hits==1]) } simplify2array(mclapply(bamFiles, counter, genes)) One issue I understand people have is that mclapply uses 'serialize()' to convert the return value of each function to a raw vector. raw vectors have the same total length limit as any other vector (2^31 -1 elements) and this places a limit on the size of chunk returned by each core. Also I believe that exceeding the limit can silently corrupt the data (i.e., a bug). This is second-hand information. Martin Thank you, Ivan ___ Bioc-sig-sequencing mailing list Bioc-sig-sequencing@r-project.org https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing -- Computational Biology Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: M1-B861 Telephone: 206 667-2793 ___ Bioc-sig-sequencing mailing list Bioc-sig-sequencing@r-project.org https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
Re: [Bioc-sig-seq] Loading large BAM files
Hi Kasper, I was monitoring the processes of the computer while scanBam was running and the process is clearly CPU limited. That's why I brought up the point. Ivan Ivan Gregoretti, PhD National Institute of Diabetes and Digestive and Kidney Diseases National Institutes of Health 5 Memorial Dr, Building 5, Room 205. Bethesda, MD 20892. USA. Phone: 1-301-496-1016 and 1-301-496-1592 Fax: 1-301-496-9878 On Wed, Jul 13, 2011 at 4:45 PM, Kasper Daniel Hansen wrote: > On Wed, Jul 13, 2011 at 4:36 PM, Ivan Gregoretti wrote: >> Hi everybody, >> >> As I wait for my large BAM to be read in by scanBAM, I can't help but to >> wonder: >> >> Has anybody tried combining scanBam with multicore to load the >> chromosomes in parallel? >> >> That would require >> >> 1) to merge the chunks at the end and >> >> 2) the original BAM to be indexed. >> >> Does anybody have any experience to share? > > Multicore will help if your task is cpu limited (and you can chunk > it). However, your specific task is probably I/O limited, in which > case it is not clear that it would help to have more cores; in fact it > might be detrimental depending on your filesystem and hardware. > > Kasper > ___ Bioc-sig-sequencing mailing list Bioc-sig-sequencing@r-project.org https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
Re: [Bioc-sig-seq] Loading large BAM files
On Wed, Jul 13, 2011 at 4:36 PM, Ivan Gregoretti wrote: > Hi everybody, > > As I wait for my large BAM to be read in by scanBAM, I can't help but to > wonder: > > Has anybody tried combining scanBam with multicore to load the > chromosomes in parallel? > > That would require > > 1) to merge the chunks at the end and > > 2) the original BAM to be indexed. > > Does anybody have any experience to share? Multicore will help if your task is cpu limited (and you can chunk it). However, your specific task is probably I/O limited, in which case it is not clear that it would help to have more cores; in fact it might be detrimental depending on your filesystem and hardware. Kasper ___ Bioc-sig-sequencing mailing list Bioc-sig-sequencing@r-project.org https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing