Hi all,
I am trying to filter files from a directory (code provided below) by
comparing the contents of each file with a hash ref (a parsed id map
file provided as an argument). The code is working however, is extremely
slow. The .csv files (81 files) that I am reading are not very large
(largest file is 183,258 bytes). I would appreciate if you could
suggest improvements to the code.
sub filter {
my ( $pazar_dir_path, $up_map, $output ) = @_;
croak "Not enough arguments! " if ( @_ < 3 );
my $accepted = 0;
my $rejected = 0;
opendir DH, $pazar_dir_path or croak ("Error in opening directory
'$pazar_dir_path': $!");
open my $OUT, '>', $output or croak ("Cannot open file for writing
'$output': $!");
while ( my @data_files = grep(/\.csv$/,readdir(DH)) ) {
my @records;
foreach my $file ( @data_files ) {
open my $FH, '<', "$pazar_dir_path/$file" or croak ("Cannot
open file '$file': $!");
while ( my $data = <$FH> ) {
chomp $data;
my $record_output;
@records = split /\t/, $data;
foreach my $up_acs ( keys %{$up_map} ) {
foreach my $ensemble_id (
@{$up_map->{$up_acs}{'Ensembl_TRS'}} ){
if ( $records[1] eq $ensemble_id ) {
$record_output = join( "\t", @records );
print $OUT "$record_output\n";
$accepted++;
}
else {
$rejected++;
next;
}
}
}
}
close $FH;
}
}
close $OUT;
closedir (DH);
print "accepted records: $accepted\n, rejected records: $rejected\n";
return $output;
}
__DATA__
TF0000210 ENSMUST00000001326 SP1_MOUSE GS0000422
ENSMUSG00000037974 7 148974877 149005136 Mus musculus
MUC5AC 14570593 ELECTROPHORETIC MOBILITY SHIFT ASSAY
(EMSA)::SUPERSHIFT
TF0000211 ENSMUST00000066003 SP3_MOUSE GS0000422
ENSMUSG00000037974 7 148974877 149005136 Mus musculus
MUC5AC 14570593 ELECTROPHORETIC MOBILITY SHIFT ASSAY
(EMSA)::SUPERSHIFT
Thanks a lot,
Aravind
--
To unsubscribe, e-mail: beginners-unsubscr...@perl.org
For additional commands, e-mail: beginners-h...@perl.org
http://learn.perl.org/