Kumar, thank you, that is the exact solution to my problem as I have formulated it. That's valid and it stands, but I should have added that the two logs each have time stamps and that we are looking for missing records with time stamps in reasonable proximity.
I have come up with a solution where I make rounded time as the key, and then in the reducer sort all records that fall within the rounded time, and after that I am free to find the missing ones or anything else I want about them. What do you think? Sincerely, Mark On Sun, Jun 26, 2011 at 12:34 AM, Kumar Kandasami < [email protected]> wrote: > Mark - > > A thought around accomplishing this as a MapReduce Job - if you could add > the the datasource information in the mapper phase with record id as the > key, in the reducer phase you can look for record ids with missing > datasource and print the record id. > > Driver Code: > > MultipleInputs.addInputPath(conf, log1path, InputFormat, > Log1Mapper); > MultipleInputs.addInputPath(conf, log2path, InputFormat, > Log2Mapper); > > Mapper Phase - > > Output - Key - Record Id, Value contains the datasource in > addition to other values. > Logic - add the datasource information to the record. > > Reduce Phase - > > Output - Print the Record Id that does not have log2 or log1 > datasource value. > Logic - add to the output only records that does not have log1 or > log2 datasource. > > > Kumar _/|\_ > > > On Sat, Jun 25, 2011 at 11:39 PM, Mark Kerzner <[email protected] > >wrote: > > > Hi, > > > > I have two logs which should have all the records for the same record_id, > > in > > other words, if this record_id is found in the first log, it should also > be > > found in the second one. However, I suspect that the second log is > filtered > > out, and I need to find the missing records. Anything is allowed: > MapReduce > > job, Hive, Pig, and even a NoSQL database. > > > > Thank you. > > > > It is also a good time to express my thanks to all the members of the > group > > who are always very helpful. > > > > Sincerely, > > Mark > > >
