Re: How to speed up processing two big files

John W . Krahn Mon, 12 Jul 2004 12:21:03 -0700

On Monday 12 July 2004 10:59, Tang, Hannah (NIH/NLM) wrote:
>
> Hi,

Hello,


>    I have two big text files, I need to read one line from first
> file, write some information from this line to a new file, and search
> second file to find lines with same control_id, and write more
> information to the new file, I wrote in perl, but it tooks half day
> to finish joining the two files. Do you have any suggest?
>
>    Below are some of my code.
> ====================================================================
>
> #!/usr/bin/perl -w
> #use IO::FILE;
> #use strict 'subs';
> #
>
> $file1="file1.txt";
> $file2="file2.txt";
>
> open (SOURCE, "$file1")
>         or die "can't open the $file1: $!";
>
> while (<SOURCE>) {
>         $control_id = substr($_, 0, 22);
>
>         open (SINK, ">>newFile.dat")
>                 or die "can't open the newFile.dat: $!";

You don't need to open this file inside the loop.  Open it once before
the loop starts.


>         print SINK $control_id;
>         #write more to newFile.dat
>
>         open (ADDSOURCE, "$file2")
>                 or die "can't open the $file2: $!";
>
>         while (<ADDSOURCE>) {
>                 if ($_ =~ /^$control_id/) {
>                         print SINK substr($_, 31, 3);
>                         #write more to newFile.dat
>                         $weight = substr($_, 48, 7);
>                         $totalWeight += $weight;
>                         $_ = <ADDSOURCE>;
>                         while ($_ =~ /^$control_id/) {
>                                 print SINK substr($_, 31, 3);
>                                 #write more to newFile.dat
>                                 $weight = substr($_, 48, 7);
>                                 $totalWeight += $weight;
>                                 $_ = <ADDSOURCE>;
>                                 }#end of while
>                         print SINK "$totalWeight";
>                         seek(ADDSOURCE, 0, 2)
>                                 or die "Couldn't seek to the end:
> $!\n";
>
>                         }#end of if
>                 }#end of while for ADDSOURCE

You are doing way too much inside the while loop.  This may not speed
up your program but it will make it a lot easier to read.  :-)

        while ( <ADDSOURCE> ) {
                next unless /^$control_id/;
                print SINK substr $_, 31, 3;
                #write more to newFile.dat
                $totalWeight += substr $_, 48, 7;
                }
        print SINK $totalWeight;


>         close(ADDSOURCE) or die "can't close $ADDSOURCE: $!\n";
>         close(SINK) or die "can't close $SINK: $!\n";
>         } #end of while for SOURCE
>   close(SOURCE) or die "can't close $SOURCE: $!\n";


Can you fit all of the control ids from "file1.txt" into an array or
hash in memory?  Perhaps a tied hash will help.

#!/usr/bin/perl -w
use strict;
#  UNTESTED !!

my $file1 = 'file1.txt';
my $file2 = 'file2.txt';

open SOURCE, $file1 or die "can't open the $file1: $!";
my ( $order, %control_ids );
while ( <SOURCE> ) {
    $control_ids{ substr $_, 0, 22 } = {
        order  => ++$order,
        field  => [],       # don't know what to call this?
        weight => 0,
        };
    }
close SOURCE or die "can't close $file1: $!\n";

open ADDSOURCE, $file2 or die "can't open the $file2: $!";
while ( <ADDSOURCE> ) {
    my $id = substr $_, 0, 22;
    next unless exists $control_ids{ $id };
    push @{ $control_ids{ $id }{ field } }, substr $_, 31, 3;
    $control_ids{ $id }{ weight } += substr $_, 48, 7;
    }
close ADDSOURCE or die "can't close $file2: $!\n";

open SINK, '>>newFile.dat' or die "can't open the newFile.dat: $!";
for my $id ( sort { $control_ids{ $a }{ order } <=> $control_ids{ $a }{ order } } keys 
%control_ids ) {
    print SINK $id, @{ $control_ids{ $id }{ field } }, $control_ids{ $id }{ weight };
    }
close SINK or die "can't close newFile.dat: $!\n";

__END__



John
-- 
use Perl;
program
fulfillment

-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
<http://learn.perl.org/> <http://learn.perl.org/first-response>

Re: How to speed up processing two big files

Reply via email to