Re: [Bioc-sig-seq] Loading large BAM files

Martin Morgan Wed, 13 Jul 2011 13:57:58 -0700

On 07/13/2011 01:36 PM, Ivan Gregoretti wrote:

Hi everybody,


As I wait for my large BAM to be read in by scanBAM, I can't help but to wonder:

Has anybody tried combining scanBam with multicore to load the
chromosomes in parallel?

That would require

1) to merge the chunks at the end and

2) the original BAM to be indexed.

Does anybody have any experience to share?


Was wondering how large and long we're talking about?

Use of ScanBamParam(what=...) can help.

For some tasks I'd think of a coarser granularity, e.g., in the contextof multiple bam files so that the data reduction (to a vector of10,000's of counts) occurs on each core.


  counter = function(fl, genes) {
      aln = readGappedAlignments(fl)
      strand(aln) = "*"
      hits = countOverlaps(aln, genes)
      countOverlaps(genes, aln[hits==1])
  }
  simplify2array(mclapply(bamFiles, counter, genes))

One issue I understand people have is that mclapply uses 'serialize()'to convert the return value of each function to a raw vector. rawvectors have the same total length limit as any other vector (2^31 -1elements) and this places a limit on the size of chunk returned by eachcore. Also I believe that exceeding the limit can silently corrupt thedata (i.e., a bug). This is second-hand information.


Martin


Thank you,

Ivan

_______________________________________________
Bioc-sig-sequencing mailing list
Bioc-sig-sequencing@r-project.org
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing



--
Computational Biology
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109

Location: M1-B861
Telephone: 206 667-2793

_______________________________________________
Bioc-sig-sequencing mailing list
Bioc-sig-sequencing@r-project.org
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing

Re: [Bioc-sig-seq] Loading large BAM files

Reply via email to