I have about 2 million records which have about 4 string fields each which needs to be checked for duplicates. To be more specific I have name, phone, address and fathername as fields and I must check for dedupe using all these fields with rest of data. The resulting unique records need to be noted into db.
I have been able to implement mapreduce, iterarate of all records. Task rate is set to 100/s and bucket-size to 100. Billing enabled. Currently, everything is working, but performance is very very slow. I have been able to complete only 1000 records dedupe processing among a test dataset of 10,000 records in 6 hours. The current design in java is: In every map iteration, I compare the current record with the previous record - Previous record is a single record in db which acts like a global variable which I overwrite with another previous record in each map iteration - Comparison is done using an algorithm and result is written as a new entity to db - At the end of one Mapreduce job, i programatically create another job - The previous record variable helps the job to compare with next candidate record with rest of the data - I am ready to increase any amount of GAE resources to achieve this in shortest time. My Questions are: - Will the accuracy of dedupe (checking for duplicates) affect due to parallel jobs/tasks? - How can this design be improved? - Will this scale to 20 million records - Whats the fastest way to read/write variables (not just counters) during map iteration which can be used across one mapreduce job. - Freelancers most welcome to assist in this. Thanks for your help. -- You received this message because you are subscribed to the Google Groups "Google App Engine" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/google-appengine?hl=en.
