Re: [google-appengine] Bad Performance for Dedupe of 2 million records using mapreduce on Appengine

Robert Kluin Thu, 21 Jul 2011 23:04:04 -0700

Hi,
  Is data going to be added while this job is running?  Because that is
important to know.


  I like something like Stephen's second idea. Change your data model to use
those four fields (or a hash of them) as the key name. Map over your
existing data rewriting it. Once your done, there will be no dupes. If you
also change the code that creates data to do the same you can
transactionally check for a record before creating it or simply write the
new record  -- either way will prevent duplicates.  This will scale to
millions of entities with no issue and if there are no other fields your
trying to merge will be cheaper since you do not need to query, just write
the new entity each time.

Robert


On Jul 21, 2011 12:26 PM, "Stephen Johnson" <[email protected]> wrote:
>
> Hi,
> I'll take a stab at trying to help you out. I think I understand what you
are attempting to do.
>
> Using mapreduce:
> The way I would do this is to create (if you don't already have one) is a
composite index on your four fields of name, phone, address and fathername.
With that being said, since these are all equality type comparisons a
zig-zag merge join perhaps would also do what you want, but the composite
index should offer higher throughput for such a large operation with
increased up front cost for the index creation.
>
> So, then for every entity that is mapped, I would do a keys-only query.
This should return at least one key, the current entity. If multiple
entities are returned, then I would delete the additional entities.  Now, I
don't think I'd add it to the datastore mutation pool since there would be a
delay before they are deleted and then they could also get mapped which
could delete the entity that was just de-duped. Now, with that being said,
there seems to exist the possibility that another mapper could be working on
another duplicate entity at the same time and thus could delete the entity
we just used for dedupe purposes. There seem to be a couple of ways out of
this.
>
> 1.) If the keys-only query is performed and only one key is returned and
it isn't the key value of the entity we are deduping then obviously it was a
duplicate itself that was deleted so we ignore this one.
>
> 2.) Or whenever duplicates are found a secondary table is used to store
the  key values of the duplicated entries as a list and using the 4
attributes of name, phone, address and fathername as they key. Then when the
entire dataset has been processed. A second mapreduce would go through the
second table to remove the duplicate entries. For each entry, one of the key
values would be saved and the other key values would be used to delete the
duplicate values.
>
> Perhaps a better option is using a backend server and cursor:
> I think another way to go might be to use a backend server and a cursor.
Create the composite index described above and then query over this index
using a sort order matching the composite index. Now, since these entries
are in sort order, then all the duplicates should be next to each other so
for each entity, if the values match the entity before it then it is a
duplicate and it can be deleted. This is a much simpler approach but doesn't
have the parallelism as the map reduce, but it is simpler.
>
> I'm not sure if either of these suggestions helps you out, but hopefully
they might.
>
> Stephen
> Founder CortexConnect
> www.cortexconnect.com
>
>
> On Wed, Jul 20, 2011 at 8:07 PM, charming30 <[email protected]> wrote:
>>
>>
>> I have about 2 million records which have about 4 string fields each
>> which needs to be checked for duplicates. To be more specific I have
>> name, phone, address and fathername as fields and I must check for
>> dedupe using all these fields with rest of data. The resulting unique
>> records need to be noted into db.
>>
>> I have been able to implement mapreduce, iterarate of all records.
>> Task rate is set to 100/s and bucket-size to 100. Billing enabled.
>>
>> Currently, everything is working, but performance is very very slow. I
>> have been able to complete only 1000 records dedupe processing among a
>> test dataset of 10,000 records in 6 hours.
>>
>> The current design in java is:
>>
>> In every map iteration, I compare the current record with the previous
>> record
>> - Previous record is a single record in db which acts like a global
>> variable which I overwrite with another previous record in each map
>> iteration
>> - Comparison is done using an algorithm and result is written as a new
>> entity to db
>> - At the end of one Mapreduce job, i programatically create another
>> job
>> - The previous record variable helps the job to compare with next
>> candidate record with rest of the data
>> - I am ready to increase any amount of GAE resources to achieve this
>> in shortest time.
>>
>> My Questions are:
>>
>> - Will the accuracy of dedupe (checking for duplicates) affect due to
>> parallel jobs/tasks?
>> - How can this design be improved?
>> - Will this scale to 20 million records
>> - Whats the fastest way to read/write variables (not just counters)
>> during map iteration which can be used across one mapreduce job.
>> - Freelancers most welcome to assist in this.
>>
>> Thanks for your help.
>>
>> --
>> You received this message because you are subscribed to the Google Groups
"Google App Engine" group.
>> To post to this group, send email to [email protected].
>> To unsubscribe from this group, send email to
[email protected].
>> For more options, visit this group at
http://groups.google.com/group/google-appengine?hl=en.
>>
>
> --
> You received this message because you are subscribed to the Google Groups
"Google App Engine" group.
> To post to this group, send email to [email protected].
> To unsubscribe from this group, send email to
[email protected].
> For more options, visit this group at
http://groups.google.com/group/google-appengine?hl=en.

-- 
You received this message because you are subscribed to the Google Groups 
"Google App Engine" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/google-appengine?hl=en.

Re: [google-appengine] Bad Performance for Dedupe of 2 million records using mapreduce on Appengine

Reply via email to