RE: techniques for handling large text files

Tom Kinzer Sun, 28 Dec 2003 23:24:20 -0800

Maybe try not checking the marker counter every single line.

Also, is the end of file check really necessary?  You could just let it drop
out naturally.


These are probably minor hits compared to a major I/O tweak someone may come
up with but is a suggestion.  I don't see anything jumping out at me for
I/O.

Have you tried benchmark on it?

-Tom Kinzer

-----Original Message-----
From: danl001 [mailto:[EMAIL PROTECTED]
Sent: Sunday, December 28, 2003 10:06 PM
To: [EMAIL PROTECTED]
Subject: techniques for handling large text files


Hi,

If this question would be better posted to another perl list, please let
me know.

I have a very large text files (~2 GB) and it's in the following format:

header line
header line
header line
marker 1
header line
header line
header line
marker 2
line type 1
line type 1
line type 1
...
line type 1
line type 2
line type 2
line type 2
...
line type 2
end of file marker line


My objective is to put all "line type 1" lines to file1.txt and all
"line type 2" lines to file2.txt. The "header line" and any of the
marker lines will not appear in either file1.txt or file2.txt. Note
there is no marker line between where line type 1 ends and where line
type 2 starts, but that can be determined by examining a field in the line.

So I have a script to do this. Essentially, it visits each line in the
file and decides which output file to write it to. The problem is it
takes a long time to run (roughly 45 min) (dual p4, 512 ram). I'd like
to cut this running time down as much as possible. What I'm looking is
either suggestions on a better way to do this in perl, or suggestions or
techniques I could use to speed up my current script. I have pasted the
relevant parts of the script below. I noticed I could shave a bit off
the runtime by reading the original file in a buffered manner instead of
  line by line. My outputs to file1.txt and file2.txt at this point take
place with prints to their respective file handles.

Any suggestions that will speed this up in any way will be greatly
appreciated! Thanks,

Dan

--- script ---

# open original file
open(INPUT, $filename) or die "error: $filename cannot be opened\n";
my $BUFFER_SIZE = 4096;
my $buffer = "";
my $sz_buffer = 0;

# open output file for line type 1
my $out1 = "$file1.txt";
open(OUT1, ">$out1") or die "error: $out1 cannot be opened for writing\n";

# open output file for line type 2
my $out2 = "$file2.txt";
open(OUT2, ">$out2") or die "error: $out2 cannot be opened for writing\n";

# counter for the markers we see
my $marker_count = 0;

my $regex_split_space='\s+';
my $regex_split_newline='\n';
my $regex_marker='^marker';
my $regex_eof='^end file';

while (my $rv = read(INPUT, $buffer, $BUFFER_SIZE)) {

     if ($rv >= $BUFFER_SIZE) {
         $buffer .= <INPUT>;
     }

     #print "rv: $rv\n";
     my @lines = split(/$regex_split_newline/o, $buffer);
     # process each line in zone file
     foreach my $line (@lines) {

         #print "line: $line\n";
         if ( $marker_count != 2 ) {

             # if we haven't seen 2 marker lines, we
             # are still in the header section of the file
             if ( $line =~ m/$regex_marker/o ) {
                $marker_count++;
            }

         } elsif ( $line =~ m/$regex_eof/o ) {
             # end of the input file. close
             # our two output files and get out
             close(OUT1);
             close(OUT2);
             exit 0;

         } else {
             # a line we care about

             # split the line on a space character
            my @fields = split(/$regex_split_space/o, $line);

             # check the second field in this line
             if ( $fields[1] eq "1" ) {

                 print OUT1 "$line\n";

             } elsif ( $fields[1] eq "2" ) {

                 print OUT2 "$line\n";

             } else {

                print "@fields\n";
                die "saw something other than a 1 or 2 line\n";

             }
         }
     }

     $buffer = "";
}


--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
<http://learn.perl.org/> <http://learn.perl.org/first-response>



-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
<http://learn.perl.org/> <http://learn.perl.org/first-response>

RE: techniques for handling large text files

Reply via email to