I have a semi-big mess here.

Never mind how I managed it but let me describe what I want to do and
maybe some kind soul can steer me to something on cpan or just have
some helpful clues.

I have two directories of news/mail messages containing different
number of files. However I expect there is file overlap in both
directions.

That is, some files are the same message in either directory.  

It gets even more complicated in that there may actually be different
headers present but the body message will be identical.

By different headers I mean 1 or 2, like a differing Xref
header. Or possible the Subject, From: or Newsgroup: headers.

That may even be true inside a single directory.  That is, there may be
duplicate bodies in the same directory too.

All I can think of for coding is to ignore the headers completely and
compare only the bodies.  

But even doing that sounds like a fairly complicated undertaking.

Seems like every single message (body) would need to be compared first
to every other in its own directory and then to all in the other
directory.

At least I expect the bodies would be identical, not just close.

I can probably manage the coding.. it would be sloppy and primitive
but I can probably do it with maybe a little help.

What I'm asking here is not the coding so much, although that is
welcome too, but really a general plan of how to go at this.

My first thought was to hold each body in turn in an array, then
holding that, make arrays one by one of the other bodies and compare
each as I go along.

Or maybe something with hashes, since they have that handy property of
keys canceling if they are the same.

Then again it might be quicker to slurp a body as a string and use a
uniqifier like:

    if ($data{$_}++ == 0) {
      (do something with $_);
    }

But that might have problems with differing number of blank lines...
Although I do expect the bodies to be identical, I'm not really 100%
sure about that either.  May have to include some leveling code to
chomp the lines and remove any blanks so the number of blank lines
isn't considered ...

Any guidance gratefully accepted..


-- 
To unsubscribe, e-mail: beginners-unsubscr...@perl.org
For additional commands, e-mail: beginners-h...@perl.org
http://learn.perl.org/


Reply via email to