This can be easily extended to be a general purpose match/merge program.
Suppose we call the two inputs A and B.  Each ID is in one of three
possible cases, and so we want three subroutines, named e.g., just_in_a,
just_in_b, and in_both.   (In the original example just_in_a would do
the same thing as just_in_b, but that is not always desired.) 

I am looking for perl code that does this, in a configurable way, e.g.
let the user specify the ID column/s, sort the two inputs (if not
already sorted), read them both, call the subs, etc.  Please send a link
or the code itself.

thanks,
Steve
-- 
Steven Tolkin    Steve-d0t-Tolkin-at-fmr-d0t-com     508-787-9006
Fidelity Investments   400 Puritan Way M3B     Marlborough MA 01752
There is nothing so practical as a good theory.  Comments are by me, 
not Fidelity Investments, its subsidiaries or affiliates.


-----Original Message-----
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of
John Macdonald
Sent: Monday, August 27, 2007 4:01 PM
To: Alex Brelsfoard
Cc: [email protected]
Subject: Re: [Boston.pm] merge and compare help

Your solution is the right one.  The final trick is to make
sure you keep going with one file after the other file reaches
the end.  I usually have the file read routine return a fake
record for EOF, that has a key guaranteed to be higher than
any real key.  (That requires knowing what the keys look like,
but it will often be something like "\255\255\255\255".)  The
merge subroutine checks for that EOF key and exits.  If a merge
is done for a different key, then neither file can be at EOF.
If a record is written without needing a merge, then that file
at least is not at EOF.  This trick gets rid of a lot of code
that checks whether either or both files are at EOF when you
are deciding whether to read from a file, and comparing the current 
records.

On Mon, Aug 27, 2007 at 02:04:57PM -0400, Alex Brelsfoard wrote:
> Hi All,
> 
> I'm back and with a new algorithm/solution I need help with.
> I have two csv files, sorted by the first column (ID).
> Each file may have all the same, none of the same, or some of the same
ID's.
> I would like to take these two files, and make one out of them.
> Two tricks:
>  - When I come across the same ID in each file I need to merge those
two
> lines (don't worry about the merge, I can handle that).
>  - I want to be looking at the least number of lines from each file as
> possible at any one time (optimally I would like to only be looking at
one
> of each file at the same time).
> 
> Basically we are dealing with large files here and I don't want to
kill my
> RAM by storing all the data from both files into a hash or some other
> object.
> 
> I have an algorithm I like, I'm just not certain how to implement it:
> 1. Examine the ID of the first line of each file.
> 2. If they are the same, then merge and print the merge to the final
output
> file..
> 3. If they are not the same, find the lesser one and have it print its
> contents to the final output file until its ID is the same or greater
than
> the other file's.
> 4. repeat.
> 
> Any advice on how to do this?
> 
> Thanks.
> --Alex
>  
> _______________________________________________
> Boston-pm mailing list
> [email protected]
> http://mail.pm.org/mailman/listinfo/boston-pm
 
_______________________________________________
Boston-pm mailing list
[email protected]
http://mail.pm.org/mailman/listinfo/boston-pm

 
_______________________________________________
Boston-pm mailing list
[email protected]
http://mail.pm.org/mailman/listinfo/boston-pm

Reply via email to