On 07/13/2011 01:36 PM, Ivan Gregoretti wrote:
Hi everybody,

As I wait for my large BAM to be read in by scanBAM, I can't help but to wonder:

Has anybody tried combining scanBam with multicore to load the
chromosomes in parallel?

That would require

1) to merge the chunks at the end and

2) the original BAM to be indexed.

Does anybody have any experience to share?

Was wondering how large and long we're talking about?

Use of ScanBamParam(what=...) can help.

For some tasks I'd think of a coarser granularity, e.g., in the context of multiple bam files so that the data reduction (to a vector of 10,000's of counts) occurs on each core.

  counter = function(fl, genes) {
      aln = readGappedAlignments(fl)
      strand(aln) = "*"
      hits = countOverlaps(aln, genes)
      countOverlaps(genes, aln[hits==1])
  }
  simplify2array(mclapply(bamFiles, counter, genes))

One issue I understand people have is that mclapply uses 'serialize()' to convert the return value of each function to a raw vector. raw vectors have the same total length limit as any other vector (2^31 -1 elements) and this places a limit on the size of chunk returned by each core. Also I believe that exceeding the limit can silently corrupt the data (i.e., a bug). This is second-hand information.

Martin


Thank you,

Ivan

_______________________________________________
Bioc-sig-sequencing mailing list
Bioc-sig-sequencing@r-project.org
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing


--
Computational Biology
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109

Location: M1-B861
Telephone: 206 667-2793

_______________________________________________
Bioc-sig-sequencing mailing list
Bioc-sig-sequencing@r-project.org
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing

Reply via email to