Maybe try not checking the marker counter every single line. Also, is the end of file check really necessary? You could just let it drop out naturally.
These are probably minor hits compared to a major I/O tweak someone may come up with but is a suggestion. I don't see anything jumping out at me for I/O. Have you tried benchmark on it? -Tom Kinzer -----Original Message----- From: danl001 [mailto:[EMAIL PROTECTED] Sent: Sunday, December 28, 2003 10:06 PM To: [EMAIL PROTECTED] Subject: techniques for handling large text files Hi, If this question would be better posted to another perl list, please let me know. I have a very large text files (~2 GB) and it's in the following format: header line header line header line marker 1 header line header line header line marker 2 line type 1 line type 1 line type 1 ... line type 1 line type 2 line type 2 line type 2 ... line type 2 end of file marker line My objective is to put all "line type 1" lines to file1.txt and all "line type 2" lines to file2.txt. The "header line" and any of the marker lines will not appear in either file1.txt or file2.txt. Note there is no marker line between where line type 1 ends and where line type 2 starts, but that can be determined by examining a field in the line. So I have a script to do this. Essentially, it visits each line in the file and decides which output file to write it to. The problem is it takes a long time to run (roughly 45 min) (dual p4, 512 ram). I'd like to cut this running time down as much as possible. What I'm looking is either suggestions on a better way to do this in perl, or suggestions or techniques I could use to speed up my current script. I have pasted the relevant parts of the script below. I noticed I could shave a bit off the runtime by reading the original file in a buffered manner instead of line by line. My outputs to file1.txt and file2.txt at this point take place with prints to their respective file handles. Any suggestions that will speed this up in any way will be greatly appreciated! Thanks, Dan --- script --- # open original file open(INPUT, $filename) or die "error: $filename cannot be opened\n"; my $BUFFER_SIZE = 4096; my $buffer = ""; my $sz_buffer = 0; # open output file for line type 1 my $out1 = "$file1.txt"; open(OUT1, ">$out1") or die "error: $out1 cannot be opened for writing\n"; # open output file for line type 2 my $out2 = "$file2.txt"; open(OUT2, ">$out2") or die "error: $out2 cannot be opened for writing\n"; # counter for the markers we see my $marker_count = 0; my $regex_split_space='\s+'; my $regex_split_newline='\n'; my $regex_marker='^marker'; my $regex_eof='^end file'; while (my $rv = read(INPUT, $buffer, $BUFFER_SIZE)) { if ($rv >= $BUFFER_SIZE) { $buffer .= <INPUT>; } #print "rv: $rv\n"; my @lines = split(/$regex_split_newline/o, $buffer); # process each line in zone file foreach my $line (@lines) { #print "line: $line\n"; if ( $marker_count != 2 ) { # if we haven't seen 2 marker lines, we # are still in the header section of the file if ( $line =~ m/$regex_marker/o ) { $marker_count++; } } elsif ( $line =~ m/$regex_eof/o ) { # end of the input file. close # our two output files and get out close(OUT1); close(OUT2); exit 0; } else { # a line we care about # split the line on a space character my @fields = split(/$regex_split_space/o, $line); # check the second field in this line if ( $fields[1] eq "1" ) { print OUT1 "$line\n"; } elsif ( $fields[1] eq "2" ) { print OUT2 "$line\n"; } else { print "@fields\n"; die "saw something other than a 1 or 2 line\n"; } } } $buffer = ""; } -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] <http://learn.perl.org/> <http://learn.perl.org/first-response> -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] <http://learn.perl.org/> <http://learn.perl.org/first-response>