Hi Ole, The number of lines (reads) in reads.ids is ~9 million. The number of alignment lines in the SAM/BAM file is ~372,281,262.
Cheers, Nathan -----Original Message----- From: [email protected] [mailto:[email protected]] On Behalf Of Ole Tange Sent: Saturday, 10 August 2013 7:05 AM To: Nathan S. Watson-Haigh Cc: [email protected] Subject: Re: Parallelising grep On Fri, Aug 9, 2013 at 7:53 AM, Nathan S. Watson-Haigh <[email protected]> wrote: > > I have a SAM/BAM file and I'd like to grep for alignments of certain > reads IDs. I have the read ID strings in another file. I'm currently > doing this > with: > > $ samtools view in.bam | fgrep -w -f read.ids > alignments.txt It will help if we get some idea of the size of the bam and ids, so give the output for: $ samtools view in.bam | wc $ wc read.ids $ samtools view in.bam | fgrep -w -f read.ids | wc Based on no information I would do split ids into a chunk per cpu: $ parallel --round-robin --pipe --block 1k cat ">"id.{#} And then run one per CPU: $ parallel "samtools view in.bam | fgrep -w -f {}" ::: id.* > alignments.txt /Ole ______________________________________________________________________ This email has been scanned by the Symantec Email Security.cloud service. For more information please visit http://www.symanteccloud.com ______________________________________________________________________ ______________________________________________________________________ This email has been scanned by the Symantec Email Security.cloud service. For more information please visit http://www.symanteccloud.com ______________________________________________________________________
