On Mar 26, 12:27 pm, [EMAIL PROTECTED] (Tom Phoenix) wrote: > On Wed, Mar 26, 2008 at 8:18 AM, <[EMAIL PROTECTED]> wrote: > > I have two sorted files (one string per line). > > [I'd also like to know how to sorvle this if the lists weren't sorted > > (as complimented sets)].] > > I want to output the List1 items not found in the List2 file. > > grep is too slow. > > diff gets stuck because list2 has millions of items. > > If the lists aren't sorted, it's probably best to read the second list > (the list of filters) into a hash. But since they're sorted, and > because you have many filters, it's more efficient to read the files > in parallel. > > My first draft of this program used this line to implement the inner loop: > > $current_filter = <FILTERS> while $item gt $current_filter; > > ... but then I realized that the second file could run out of filters > before the first one runs out of data, so it had to become more > complex: > > #!/usr/bin/perl > > use strict; > use warnings; > > die "huh?" unless @ARGV == 2; > my($data_file, $filters) = @ARGV; > > open DATA_FILE, $data_file or die "Can't read '$data_file': $!"; > open FILTERS, $filters or die "Can't read '$filters': $!"; > > my $current_filter = ''; > > # outer loop reads a line at a time > DATA_LINE: > while (my $item = <DATA_FILE>) { > > # inner loop updates the filter, if needed > # This inner loop would be just this line: > ### $current_filter = <FILTERS> while $item gt $current_filter; > # .... except that we have to allow for the filters to run out. > while ($item gt $current_filter) { > if (defined($current_filter = <FILTERS>)) { > # a filter was read from the file: normal case > } else { > # No more filters; print everything else > print $item; > print while <DATA_FILE>; > last DATA_LINE; > } > } > > # the inner loop has now updated $current_filter > print $item unless $item eq $current_filter; > } > > Hope this helps! > > --Tom Phoenix > Stonehenge Perl Training
Works great, thanks. One more thing if I may: How do I mod the code to function as is with two args (perlscr list1.txt list2.txt) or accept stdin as data_file when only one arg is given? (cat list1.txt | perlscr list2.txt) Thanks Again -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] http://learn.perl.org/