On Jun 30, 8:01 am, [EMAIL PROTECTED] (Cheez) wrote:
> Howdy, scripting with perl is a hobby and not a vocation so i
> apologize in advance for rough looking code.
>
> I have a very large list of 16-letter words called
> "hashsequence16.txt".  This file is 203MB in size.
>
> I have a large list of data called "newrawdata.txt".  This file is
> 95MB.
>
> For each 16-letter word, I am looping through "newrawdata.txt" to 1)
> find a match and 2) take the the full line of rawdata.txt and
> associate that with the 16-letter word.
>
> Using a filesize line-counter and timing how long it takes to process
> my data lets me know that I have 9534 hours to see if I can find an
> alternative solution.  It's pretty brute force but I don't know if
> there is another way to do it.
>
> Any comments or guidance would be greatly appreciated.
>
> Thanks,
> Dan
> ==========================================
>
> print "**fisher**";
>
> $flatfile = "newrawdata.txt";
> # 95MB in size
>
> $datafile = "hashsequence16.txt";
> # 203MB in size
>
> my $filesize = -s "hashsequence16.txt";
> # for use in processing time calculation
>
> open(FILE, "$flatfile") || die "Can't open '$flatfile': $!\n";
> open(FILE2, "$datafile") || die "Can't open '$flatfile': $!\n";
> open (SEQFILE, ">fishersearch.txt") || die "Can't open '$seqparsed': $!
> \n";
>
> @preparse = <FILE>;
> @hashdata = <FILE2>;
>
> close(FILE);
> close(FILE2);
>
> for my $list1 (@hashdata) {
> # iterating through hash16 data
>
>     $finish++;
>
>     if ($finish ==10 ) {
> # line counter
>
>         $marker = $marker + $finish;
>
>         $finish =0;
>
>         $left = $filesize - $marker;
>
>         printf "$left\/$filesize\n";
> # this prints every 17 seconds
>                         }
>
>     ($line, $freq) = split(/\t/, $list1);
>
>     for my $rawdata (@preparse) {
> # iterating through rawdata
>
>         $rawdata=~ s/\n//;
>
>         if ($rawdata =~ m/$line/) {
> # matching hash16 word with rawdata line
>
>             my $first_pos = index  $rawdata,$line;
>
>             print SEQFILE "$first_pos\t$rawdata\n";
> # printing to info to new file
>
>                                 }
>
>                         }
>
>     print SEQFILE "PROCESS\t$line\n";
> # printing hash16 word and "process"
>
> }



Hi there, let me see if I can help you...

always include these two...it helps on debugging, etc..

use strict;
use warnings;


> @preparse = <FILE>;
> @hashdata = <FILE2>;
       Maybe that's why your program runs so slow.
You are slurping big files into an array.

try something like ...

    my $temp_file = "temp.txt";
    open ($temp_file_fh, "<", $temp_file) or die $!;

    while (<$temp_file_fh>){
                s/[\r\n]+//; #Remove carriage returns and new lines
                 if ($_ =~ m/<your_regex-here>/){
                               print "found\n";
                  }
     }

see what I mean? use slurping with really really small files. Even so.

> For each 16-letter word, I am looping through "newrawdata.txt" to 1)
> find a match and 2) take the the full line of rawdata.txt and
> associate that with the 16-letter word.

I'd just find whatever I'm looking for on both files, push values into
an array or external file.
then I'll create a hash to associate both entries.

Try below...

for example: file 1..want to find apples.file 1 contains apples and
oranges and bananas
#!/usr/bin/perl
use strict;
use warnings;

my $ca_dir_path = "ca_files";
my $ca_log_path = "log_ca.txt";
my @ca_iea_values;
my %log_ca;


opendir (CADIR, $ca_dir_path) or die $!;

chdir $ca_dir_path;

while (defined (my $file = readdir (CADIR))){

    #skip . and .. files
        next if $file =~ m#^\.\.?$#;

    open (FILE, $file) or die $!;
        while (<FILE>) {
        chomp;
        if ( m/^IEA\*/g ) {
                my $match = $_;
                push @ca_iea_values, /apples/;
                $log_ca{ pop @ca_iea_values } = $file;

        }
        }

}

open (CA_LOG, ">$ca_log_path") or die $!;
foreach (sort { $a cmp $b } keys(%log_ca) ){
    print CA_LOG "$_->$log_ca{$_}\n";
}


 file 2: also looking for apples..this one file has apples, melons,
and berries
#!/usr/bin/perl
use strict;
use warnings;


my $aa_dir_path = "aa_files";
my $aa_log_path = "log_aa.txt";

my @aa_iea_values;
my %log_aa;

opendir (AADIR, $aa_dir_path) or die $!;
chdir $aa_dir_path;

while (defined (my $file = readdir (AADIR))){

    #skip . and .. files
        next if $file =~ m#^\.\.?$#;

    open (FILE, $file) or die $!;
        while (<FILE>) {
        chomp;
        if ( m/^IEA\{/g ) {
                my $match = $_;
                push @aa_iea_values, /apples/;
                $log_aa{ pop @aa_iea_values } = $file;

        }
        }

}


open (AA_LOG, ">$aa_log_path") or die $!;
foreach (sort { $a cmp $b } keys(%log_aa) ){
    print AA_LOG "$_->$log_aa{$_}\n";
}


file 3 would be the actual "report" generator

#!/usr/bin/perl
use warnings;
use strict;

my $ca_log_path = "log_ca.txt";
my $aa_log_path = "log_aa.txt";


my %final_report;
my @ca_filenames;
my @aa_filenames;


open (CAFILE, $ca_log_path) or die $!;
my @ca_files = <CAFILE>;

open(AAFILE, $aa_log_path) or die $!;
my @aa_files = <AAFILE>;


#sort arrays
my @ca_files_sorted = sort @ca_files;
my @aa_files_sorted = sort @aa_files;

my $total_items = @ca_files_sorted;

foreach(@ca_files_sorted){
     s/\s+\z//;  # Remove all trailing whitespace
     push @ca_filenames, /\d+->(.+)/;
}



foreach(@aa_files_sorted){
         s/\s+\z//;  # Remove all trailing whitespace
     push @aa_filenames, /\d+->(.+)/;
}

for (1..$total_items){
        $final_report{ pop @ca_filenames } = pop @aa_filenames;
}



print "APPLES FILE 1 => APPLES FILE 2\n";
print '-' x 27, "\n";
foreach (sort { $a cmp $b } keys(%final_report) ){
    print "$_ => $final_report{$_}\n";
}

Is this homework by the way dude?

anyway..my two cents..run them..if it works right away cool. If not,
that'll get you started.  There's more than way to do it.


-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/


Reply via email to