On Jun 30, 8:01 am, [EMAIL PROTECTED] (Cheez) wrote: > Howdy, scripting with perl is a hobby and not a vocation so i > apologize in advance for rough looking code. > > I have a very large list of 16-letter words called > "hashsequence16.txt". This file is 203MB in size. > > I have a large list of data called "newrawdata.txt". This file is > 95MB. > > For each 16-letter word, I am looping through "newrawdata.txt" to 1) > find a match and 2) take the the full line of rawdata.txt and > associate that with the 16-letter word. > > Using a filesize line-counter and timing how long it takes to process > my data lets me know that I have 9534 hours to see if I can find an > alternative solution. It's pretty brute force but I don't know if > there is another way to do it. > > Any comments or guidance would be greatly appreciated. > > Thanks, > Dan > ========================================== > > print "**fisher**"; > > $flatfile = "newrawdata.txt"; > # 95MB in size > > $datafile = "hashsequence16.txt"; > # 203MB in size > > my $filesize = -s "hashsequence16.txt"; > # for use in processing time calculation > > open(FILE, "$flatfile") || die "Can't open '$flatfile': $!\n"; > open(FILE2, "$datafile") || die "Can't open '$flatfile': $!\n"; > open (SEQFILE, ">fishersearch.txt") || die "Can't open '$seqparsed': $! > \n"; > > @preparse = <FILE>; > @hashdata = <FILE2>; > > close(FILE); > close(FILE2); > > for my $list1 (@hashdata) { > # iterating through hash16 data > > $finish++; > > if ($finish ==10 ) { > # line counter > > $marker = $marker + $finish; > > $finish =0; > > $left = $filesize - $marker; > > printf "$left\/$filesize\n"; > # this prints every 17 seconds > } > > ($line, $freq) = split(/\t/, $list1); > > for my $rawdata (@preparse) { > # iterating through rawdata > > $rawdata=~ s/\n//; > > if ($rawdata =~ m/$line/) { > # matching hash16 word with rawdata line > > my $first_pos = index $rawdata,$line; > > print SEQFILE "$first_pos\t$rawdata\n"; > # printing to info to new file > > } > > } > > print SEQFILE "PROCESS\t$line\n"; > # printing hash16 word and "process" > > }
Hi there, let me see if I can help you... always include these two...it helps on debugging, etc.. use strict; use warnings; > @preparse = <FILE>; > @hashdata = <FILE2>; Maybe that's why your program runs so slow. You are slurping big files into an array. try something like ... my $temp_file = "temp.txt"; open ($temp_file_fh, "<", $temp_file) or die $!; while (<$temp_file_fh>){ s/[\r\n]+//; #Remove carriage returns and new lines if ($_ =~ m/<your_regex-here>/){ print "found\n"; } } see what I mean? use slurping with really really small files. Even so. > For each 16-letter word, I am looping through "newrawdata.txt" to 1) > find a match and 2) take the the full line of rawdata.txt and > associate that with the 16-letter word. I'd just find whatever I'm looking for on both files, push values into an array or external file. then I'll create a hash to associate both entries. Try below... for example: file 1..want to find apples.file 1 contains apples and oranges and bananas #!/usr/bin/perl use strict; use warnings; my $ca_dir_path = "ca_files"; my $ca_log_path = "log_ca.txt"; my @ca_iea_values; my %log_ca; opendir (CADIR, $ca_dir_path) or die $!; chdir $ca_dir_path; while (defined (my $file = readdir (CADIR))){ #skip . and .. files next if $file =~ m#^\.\.?$#; open (FILE, $file) or die $!; while (<FILE>) { chomp; if ( m/^IEA\*/g ) { my $match = $_; push @ca_iea_values, /apples/; $log_ca{ pop @ca_iea_values } = $file; } } } open (CA_LOG, ">$ca_log_path") or die $!; foreach (sort { $a cmp $b } keys(%log_ca) ){ print CA_LOG "$_->$log_ca{$_}\n"; } file 2: also looking for apples..this one file has apples, melons, and berries #!/usr/bin/perl use strict; use warnings; my $aa_dir_path = "aa_files"; my $aa_log_path = "log_aa.txt"; my @aa_iea_values; my %log_aa; opendir (AADIR, $aa_dir_path) or die $!; chdir $aa_dir_path; while (defined (my $file = readdir (AADIR))){ #skip . and .. files next if $file =~ m#^\.\.?$#; open (FILE, $file) or die $!; while (<FILE>) { chomp; if ( m/^IEA\{/g ) { my $match = $_; push @aa_iea_values, /apples/; $log_aa{ pop @aa_iea_values } = $file; } } } open (AA_LOG, ">$aa_log_path") or die $!; foreach (sort { $a cmp $b } keys(%log_aa) ){ print AA_LOG "$_->$log_aa{$_}\n"; } file 3 would be the actual "report" generator #!/usr/bin/perl use warnings; use strict; my $ca_log_path = "log_ca.txt"; my $aa_log_path = "log_aa.txt"; my %final_report; my @ca_filenames; my @aa_filenames; open (CAFILE, $ca_log_path) or die $!; my @ca_files = <CAFILE>; open(AAFILE, $aa_log_path) or die $!; my @aa_files = <AAFILE>; #sort arrays my @ca_files_sorted = sort @ca_files; my @aa_files_sorted = sort @aa_files; my $total_items = @ca_files_sorted; foreach(@ca_files_sorted){ s/\s+\z//; # Remove all trailing whitespace push @ca_filenames, /\d+->(.+)/; } foreach(@aa_files_sorted){ s/\s+\z//; # Remove all trailing whitespace push @aa_filenames, /\d+->(.+)/; } for (1..$total_items){ $final_report{ pop @ca_filenames } = pop @aa_filenames; } print "APPLES FILE 1 => APPLES FILE 2\n"; print '-' x 27, "\n"; foreach (sort { $a cmp $b } keys(%final_report) ){ print "$_ => $final_report{$_}\n"; } Is this homework by the way dude? anyway..my two cents..run them..if it works right away cool. If not, that'll get you started. There's more than way to do it. -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] http://learn.perl.org/