Dear all, I am trying to speed up a very long procedure that I need to run on multiple files and though that I could multithread different jobs on different files across multiple CPUs. For some reason that I don't really get, I only achieve very small time gain. I have included my script which essentially repeat the same function, extractSeq() on multiple files using a maximum of four threads.
I would really appreciate if I could finally understand how to use threads to speed up some of my lengthy scripts. Thanks Marco #!/usr/local/bin/perl -w use strict; use Bio::SeqIO; use threads; use Getopt::Std; our $opt_p; init(); my @thr; for (my $i=0;$i<=$#ARGV;$i++){ push @thr, threads->new(\&extractSeq, $ARGV[$i]); if (scalar(@thr) == $opt_p || $i == $#ARGV){ print "Running ",scalar(@thr)," parallel jobs\n"; $_->join for @thr; undef @thr; } } sub extractSeq { my $file=shift; my ($dir,$pre,$suf) = ($file=~/(^.+\/|^)(.+)\.(.+$)/); my $out_name = "$pre"."_CleanSeq.$suf"; my $seqin = Bio::SeqIO->new(-file => $file, -format =>'fasta'); my $seq_out = Bio::SeqIO->new(-file => ">$out_name", -format => 'fasta'); while (my $seq = $seqin->next_seq){ if ($seq->seq =~ /AGATC/){ $seq->seq($seq->subseq(1,$-[0]+5)); $seq_out->write_seq($seq); } } return(0); } sub init { getopts("p:"); unless (@ARGV) { print("extractseq.pl [-p 4] seq_1.fa [seq_2.fa ...]\n\n", "Take the sequences from the Solexa sequences in Fasta format and\n", "\t1)Find the B primer\n", "\t2)Extract the sequences before the B primer leaving 5 nt of B primer\n\n", "-p\tNumber of processors to be used to process the files when more than one files are passed to the command line\n", "\tDefault 4\n\n"); exit(1); } $opt_p = 4 unless $opt_p; return(0); } -- Marco Blanchette, Ph.D. Assistant Investigator Stowers Institute for Medical Research 1000 East 50th St. Kansas City, MO 64110 Tel: 816-926-4071 Cell: 816-726-8419 Fax: 816-926-2018