Dear all,
I am trying to speed up a very long procedure that I need to run on multiple
files and though that I could multithread different jobs on different files
across multiple CPUs. For some reason that I don't really get, I only achieve
very small time gain. I have included my script which essentially repeat the
same function, extractSeq() on multiple files using a maximum of four threads.
I would really appreciate if I could finally understand how to use threads to
speed up some of my lengthy scripts.
Thanks
Marco
#!/usr/local/bin/perl -w
use strict;
use Bio::SeqIO;
use threads;
use Getopt::Std;
our $opt_p;
init();
my @thr;
for (my $i=0;$i<=$#ARGV;$i++){
push @thr, threads->new(\&extractSeq, $ARGV[$i]);
if (scalar(@thr) == $opt_p || $i == $#ARGV){
print "Running ",scalar(@thr)," parallel jobs\n";
$_->join for @thr;
undef @thr;
}
}
sub extractSeq {
my $file=shift;
my ($dir,$pre,$suf) = ($file=~/(^.+\/|^)(.+)\.(.+$)/);
my $out_name = "$pre"."_CleanSeq.$suf";
my $seqin = Bio::SeqIO->new(-file => $file,
-format =>'fasta');
my $seq_out = Bio::SeqIO->new(-file => ">$out_name",
-format => 'fasta');
while (my $seq = $seqin->next_seq){
if ($seq->seq =~ /AGATC/){
$seq->seq($seq->subseq(1,$-[0]+5));
$seq_out->write_seq($seq);
}
}
return(0);
}
sub init {
getopts("p:");
unless (@ARGV) {
print("extractseq.pl [-p 4] seq_1.fa [seq_2.fa ...]\n\n",
"Take the sequences from the Solexa sequences in Fasta format and\n",
"\t1)Find the B primer\n",
"\t2)Extract the sequences before the B primer leaving 5 nt of B
primer\n\n",
"-p\tNumber of processors to be used to process the files when more than
one files are passed to the command line\n",
"\tDefault 4\n\n");
exit(1);
}
$opt_p = 4 unless $opt_p;
return(0);
}
--
Marco Blanchette, Ph.D.
Assistant Investigator
Stowers Institute for Medical Research
1000 East 50th St.
Kansas City, MO 64110
Tel: 816-926-4071
Cell: 816-726-8419
Fax: 816-926-2018