Re: [google-appengine] Bad Performance for Dedupe of 2 million records using mapreduce on Appengine

Stephen Johnson Thu, 21 Jul 2011 09:26:57 -0700

Hi,
I'll take a stab at trying to help you out. I think I understand what you
are attempting to do.

Using mapreduce:
The way I would do this is to create (if you don't already have one) is a
composite index on your four fields of name, phone, address and fathername.
With that being said, since these are all equality type comparisons a
zig-zag merge join perhaps would also do what you want, but the composite
index should offer higher throughput for such a large operation with
increased up front cost for the index creation.

So, then for every entity that is mapped, I would do a keys-only query. This
should return at least one key, the current entity. If multiple entities are
returned, then I would delete the additional entities.  Now, I don't think
I'd add it to the datastore mutation pool since there would be a delay
before they are deleted and then they could also get mapped which could
delete the entity that was just de-duped. Now, with that being said, there
seems to exist the possibility that another mapper could be working on
another duplicate entity at the same time and thus could delete the entity
we just used for dedupe purposes. There seem to be a couple of ways out of
this.

1.) If the keys-only query is performed and only one key is returned and it
isn't the key value of the entity we are deduping then obviously it was a
duplicate itself that was deleted so we ignore this one.

2.) Or whenever duplicates are found a secondary table is used to store the
 key values of the duplicated entries as a list and using the 4 attributes
of name, phone, address and fathername as they key. Then when the entire
dataset has been processed. A second mapreduce would go through the second
table to remove the duplicate entries. For each entry, one of the key values
would be saved and the other key values would be used to delete the
duplicate values.

Perhaps a better option is using a backend server and cursor:
I think another way to go might be to use a backend server and a cursor.
Create the composite index described above and then query over this index
using a sort order matching the composite index. Now, since these entries
are in sort order, then all the duplicates should be next to each other so
for each entity, if the values match the entity before it then it is a
duplicate and it can be deleted. This is a much simpler approach but doesn't
have the parallelism as the map reduce, but it is simpler.

I'm not sure if either of these suggestions helps you out, but hopefully
they might.

Stephen
Founder CortexConnect
www.cortexconnect.com

On Wed, Jul 20, 2011 at 8:07 PM, charming30 <[email protected]> wrote:

>
> I have about 2 million records which have about 4 string fields each
> which needs to be checked for duplicates. To be more specific I have
> name, phone, address and fathername as fields and I must check for
> dedupe using all these fields with rest of data. The resulting unique
> records need to be noted into db.
>
> I have been able to implement mapreduce, iterarate of all records.
> Task rate is set to 100/s and bucket-size to 100. Billing enabled.
>
> Currently, everything is working, but performance is very very slow. I
> have been able to complete only 1000 records dedupe processing among a
> test dataset of 10,000 records in 6 hours.
>
> The current design in java is:
>
> In every map iteration, I compare the current record with the previous
> record
> - Previous record is a single record in db which acts like a global
> variable which I overwrite with another previous record in each map
> iteration
> - Comparison is done using an algorithm and result is written as a new
> entity to db
> - At the end of one Mapreduce job, i programatically create another
> job
> - The previous record variable helps the job to compare with next
> candidate record with rest of the data
> - I am ready to increase any amount of GAE resources to achieve this
> in shortest time.
>
> My Questions are:
>
> - Will the accuracy of dedupe (checking for duplicates) affect due to
> parallel jobs/tasks?
> - How can this design be improved?
> - Will this scale to 20 million records
> - Whats the fastest way to read/write variables (not just counters)
> during map iteration which can be used across one mapreduce job.
> - Freelancers most welcome to assist in this.
>
> Thanks for your help.
>
> --
> You received this message because you are subscribed to the Google Groups
> "Google App Engine" group.
> To post to this group, send email to [email protected].
> To unsubscribe from this group, send email to
> [email protected].
> For more options, visit this group at
> http://groups.google.com/group/google-appengine?hl=en.
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Google App Engine" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/google-appengine?hl=en.

Re: [google-appengine] Bad Performance for Dedupe of 2 million records using mapreduce on Appengine

Reply via email to