Cheez wrote:
> Howdy, scripting with perl is a hobby and not a vocation so i
> apologize in advance for rough looking code.
> 
> I have a very large list of 16-letter words called
> "hashsequence16.txt".  This file is 203MB in size.
> 
> I have a large list of data called "newrawdata.txt".  This file is
> 95MB.
> 
> For each 16-letter word, I am looping through "newrawdata.txt" to 1)
> find a match and 2) take the the full line of rawdata.txt and
> associate that with the 16-letter word.
> 
> Using a filesize line-counter and timing how long it takes to process
> my data lets me know that I have 9534 hours to see if I can find an
> alternative solution.  It's pretty brute force but I don't know if
> there is another way to do it.
> 
> Any comments or guidance would be greatly appreciated.
> 
> Thanks,
> Dan
> ==========================================
> 
> print "**fisher**";
> 
> $flatfile = "newrawdata.txt";
> # 95MB in size
> 
> $datafile = "hashsequence16.txt";
> # 203MB in size
> 
> my $filesize = -s "hashsequence16.txt";
> # for use in processing time calculation
> 
> open(FILE, "$flatfile") || die "Can't open '$flatfile': $!\n";
> open(FILE2, "$datafile") || die "Can't open '$flatfile': $!\n";
> open (SEQFILE, ">fishersearch.txt") || die "Can't open '$seqparsed': $!
> \n";
> 
> @preparse = <FILE>;
> @hashdata = <FILE2>;
> 
> close(FILE);
> close(FILE2);
> 
> 
> for my $list1 (@hashdata) {
> # iterating through hash16 data
> 
>     $finish++;
> 
>     if ($finish ==10 ) {
> # line counter
> 
>       $marker = $marker + $finish;
> 
>       $finish =0;
> 
>       $left = $filesize - $marker;
> 
>       printf "$left\/$filesize\n";
> # this prints every 17 seconds
>                       }
> 
>     ($line, $freq) = split(/\t/, $list1);
> 
>     for my $rawdata (@preparse) {
> # iterating through rawdata
> 
>       $rawdata=~ s/\n//;
> 
>       if ($rawdata =~ m/$line/) {
> # matching hash16 word with rawdata line
> 
>           my $first_pos = index  $rawdata,$line;
> 
>           print SEQFILE "$first_pos\t$rawdata\n";
> # printing to info to new file
> 
>                               }
> 
>                       }
> 
>     print SEQFILE "PROCESS\t$line\n";
> # printing hash16 word and "process"
> 
> }

First of all it would help a lot if you added

  use strict;
  use warnings;

to the top of your program and declared everything with 'my'. For some reason
you've declared $first_pos but used the package variables for everything else.
Also you've used $seqparsed in your open error message but it's not assigned
anywhere. The strict pragma would have picked that up.

There's a problem with your progress arithmetic. You're subtracting the number
of lines read from the total file size in bytes which doesn't make sense.

The best way to speed things up is to avoid repeating operations wherever
possible. First of all you're splitting the records in the hash data file over
and over again. If you're only interested in the first field (before the tab)
then discard the rest before you enter the loop. The same applies to the records
from the other file from which you're trimming the newline many times. Do this
once before the loop.

There's also a waste when you first use a regex to locate the substring and then
do it again using index(). Using the result of the regex will potentially halve
your execution time. There's also some mileage to be gained from only compiling
the regex once before you match it against each of the data lines.

The program below should be a lot faster. Whether it is fast enough for you is
another matter, but please come back to us if you need more help. beware that
it's been tested only superficially so may need more work.

HTH,

Rob



use strict;
use warnings;

my $flatfile = 'newrawdata.txt';
my $datafile = 'hashsequence16.txt';
my $seqparsed = 'fishersearch.txt';

my @preparse = do {
  open my $fh, '<', $flatfile or die "Can't open '$flatfile': $!";
  <$fh>;
};
chomp @preparse;

my @hashdata;
{
  open my $fh, '<', $datafile or die "Can't open '$datafile': $!";
  while (<$fh>) {
    my ($line) = split;
    push @hashdata, $line;
  }
}

open my $seqfile, '>', $seqparsed or die "Can't open '$seqparsed': $!";

foreach my $line (@hashdata) {

  my $re = qr/\Q$line/;
  my $linelen = length $line;

  foreach my $rawdata (@preparse) {

    if ($rawdata =~ /$re/g) {
      my $first_pos = pos($rawdata) - $linelen;
      print $seqfile "$first_pos\t$rawdata\n";
    }
  }

  print $seqfile "PROCESS\t$line\n";
}

-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/


Reply via email to