Hi Nick and others, I'm trying to extract all "properly paired" reads from a BAM file, generating a pair of fastq-files, ordered by pairs, much like you'd get from an Illumina PE run.
The code is in the darcs repo at http://malde.org/~ketil/biohaskell/bamselect, should anybody want to take a look. (The last three revisions or so are refactorings, trying to get to the bottom of this, although nothing much has improved.) The problem is that it is slow, and according to my profiling efforts, GC takes up a majority of the time - 80-90%. I was wondering whether I'm doing something stupid, or if there are some way of remedying this. It seems like the Bam1 accessor functions (queryName, querySeq, and queryQual), currently in the 'extract' function, construct new bytestrings, and that this is the cause of the majority of the allocations. Profile excerpt: COST CENTRE MODULE %time %alloc extract Main 58.0 80.5 hPutFq Main 16.6 13.3 main Main 15.8 4.2 splitBam Main 8.4 1.8 decide Main 1.2 0.3 Any ideas most welcome! -k -- If I haven't seen further, it is by standing in the footprints of giants _______________________________________________ Biohaskell mailing list Biohaskell@biohaskell.org http://malde.org/cgi-bin/mailman/listinfo/biohaskell