[Biohaskell] Performance with samtools

Ketil Malde Tue, 16 Aug 2011 14:07:21 -0700

Hi Nick and others,

I'm trying to extract all "properly paired" reads from a BAM file,
generating a pair of fastq-files, ordered by pairs, much like you'd get
from an Illumina PE run.


The code is in the darcs repo at
http://malde.org/~ketil/biohaskell/bamselect, should anybody want to
take a look.  (The last three revisions or so are refactorings, trying
to get to the bottom of this, although nothing much has improved.)

The problem is that it is slow, and according to my profiling efforts,
GC takes up a majority of the time - 80-90%.  I was wondering whether
I'm doing something stupid, or if there are some way of remedying this.

It seems like the Bam1 accessor functions  (queryName, querySeq, and
queryQual), currently in the 'extract' function, construct new
bytestrings, and that this is the cause of the majority of the
allocations. 

Profile excerpt:

  COST CENTRE                    MODULE               %time %alloc

  extract                        Main                  58.0   80.5
  hPutFq                         Main                  16.6   13.3
  main                           Main                  15.8    4.2
  splitBam                       Main                   8.4    1.8
  decide                         Main                   1.2    0.3

Any ideas most welcome!

-k
-- 
If I haven't seen further, it is by standing in the footprints of giants
_______________________________________________
Biohaskell mailing list
Biohaskell@biohaskell.org
http://malde.org/cgi-bin/mailman/listinfo/biohaskell

[Biohaskell] Performance with samtools

Reply via email to