Re: Comparing files with regular expressions

rubinsta Fri, 02 May 2008 15:30:57 -0700

Many thanks, Chas.  These are all very helpful (and educational!)
suggestions.  I adapted your example like so (specifying the all.txt
on the command-line):


#!/usr/bin/perl
use strict;
use warnings;

open my $ex, "<", "exclude.txt" or die $!;
open my $out, ">", "exTest.txt" or die $!;

my %exists;
$exists{$_} = 1 while <$ex>;

## I changed the "unless" to "if" so I could easily
## compare the output of the script to the
## original exclude.txt file

while (my $line = <>){
    print $out $line if $exists{$line};
}

The problem is the exlude.txt and exTest.txt do not match.  Everything
in the exTest.txt file is also in the exclude.txt file but there are a
number of lines that appear in the all.txt and the exclude.txt that do
not end up in exTest.txt.  The numbers are EANs and are thus all
exactly the same format, e.g. 9780657007423.  Any thoughts as to why
some of the matches are getting missed?

Just out of beginner curiosity, why did you suggest I use the 3
argument filehandle instead of:
open(EX, "exclude1.txt") or die $!

Thanks again for all your help!


On May 2, 7:41 am, [EMAIL PROTECTED] (Chas. Owens) wrote:
> On Thu, May 1, 2008 at 4:09 PM, rubinsta <[EMAIL PROTECTED]> wrote:
> > Hello,
>
> >  I'm a Perl uber-novice and I'm trying to compare two files in order to
> >  exclude items listed on one file from the complete list on the other
> >  file.  What I have so far prints out a third file listing everything
> >  that matches the exclude file from the complete file (which I'm hoping
> >  will be a duplicate of the exclude file) just so I can make sure that
> >  the comparison script is working.  The files are lists of numbers
> >  separated by newlines.  The exclude file has 333 numbers and the
> >  complete file has 9000 numbers.
>
> >  Here's what I have so far:
>
> >  #!/usr/bin/perl
> >  use strict;
> >  use warnings;
>
> >  open(ALL, "all.txt") or die $!;
> >  open(EX, "exclude.txt") or die $!;
> >  open(OUT,'>exTest.txt') or die $!;
>
> snip
>
> Use the three argument version of open and lexical filehandles:
>
> open my $ex, "<", "exclude.txt"
>     or die "could not open exclude.txt: $!";
>
> snip
>
> >  my @ex_lines = <EX>;
> >  my @all_lines = <ALL>;
>
> snip
>
> Using filehandles in list context is a bad idea.  It may work now when
> the files are small, but data almost always grows.  Unless you are
> certain that the file will remain small you should not do this.  Use a
> while loop instead.
>
> snip
>
>
>
> >  foreach $all (@all_lines){
> >    foreach $ex (@ex_lines){
> >        if ($ex =~ /(^$all)/){
>
> This is testing to see if there are any lines in the exclude file that
> start with what was in the complete file.  That is if the complete
> file was
>
> 1
> 2
>
> and the exclude file was
>
> 10
> 20
>
> then all lines would be excluded.  Is this really what you want?
> Also, given that you have not surrounded $all with \Q and \E (like
> /^\Q$all\E/) and metacharacters in $all (like *, ., ?, etc.) will be
> treated as metacharacters instead of normal characters.  Unless the
> lines in complete are know to be regexes this could be bad.  And by
> bad I mean everything from mismatches to the dreaded "(?{system qq(rm
> -rf $ENV{HOME})})".
>
> If you don't have regexes in the complete file but do want to check
> for its entires as prefixes in the exclude file, you are better off
> using a prefix tree (aka a trie*).  It is an O(m log n)** algorithm,
> as opposed to the O(n*m) algorithm you are using now.  There is at
> least one Perl implementation: Tree::Trie***.
>
> If you don't have regexes in the complete file and do not want to
> check for entries as prefixes in the exclude file you are better off
> using a hash set***** to test for existence (roughly an O(m+n)
> solution).  Luckily in Perl a hash set is easy to build, you just use
> a hash variable with the keys being your data and the values all being
> either undef or 1 depending on your style (I tend to use 1 for
> simplicity's sake, but I think undef might be smaller).  Using a hash
> also gives you the freedom to use something like DB_FILE****** if the
> files get very large (thus saving memory without having to add much
> code.
>
> snip>         print OUT $1;
> >        }
> >    }
> >  }
> >  close(ALL);
> >  close(EX);
> >  close(OUT);
>
> snip
>
> These calls to close at the end of the script are unnecessary.  Only
> call close explicitly if you need to close a file before the
> filehandle goes out of scope.
>
> Another simple tip is to treat STDIN/files on the command line as your
> complete file and STDOUT as your output file.  This form of Perl
> script is called a filter and is very easy to write and use.  What
> follows is my implementation of the hash set version:
>
> #!/usr/bin/perl
>
> use strict;
> use warnings;
>
> #this is a hack to make the script runnable
> #without external data files, in a normal
> #script you would open a real exclude file
> #here
> my $exclude = "1\n2\n3\n";
> open my $ex, "<", \$exclude
>     or die "could not open the scalar \$exculde as a file: $!";
>
> my %exists;
> $exists{$_} = 1 while <$ex>;
>
> #this is also a hack, in a normal script
> #you would say
> #while (my $line = <>) {
> #to get a loop over STDIN or files specified
> #on the commandline
> while (my $line = <DATA>) {
>         print $line unless $exists{$line};
>
> }
>
> __DATA__
> 1
> 2
> 10
> 20
>
> *http://en.wikipedia.org/wiki/Trie
> ** This is big O notation****, basically it measure the order of
> magnitude of number of steps needed to complete the algorithm.  So, if
> you had 1,000 lines in exclude and 10,000 lines in complete it would
> take roughly 10,000,000 steps to complete the algorithm you are using
> now and only 13,287 with the trie.
> ***http://search.cpan.org/~avif/Tree-Trie-1.5/Trie.pm
> ****http://en.wikipedia.org/wiki/Big_O_notation
> ***** basically a hash with no values used for testing of existance of values
> ******http://perldoc.perl.org/DB_File.html
>
> --
> Chas. Owens
> wonkden.net
> The most important skill a programmer can have is the ability to read.


-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/

Re: Comparing files with regular expressions

Reply via email to