Re: [google-appengine] Re: Bulk data deletion woe

Robert Kluin Mon, 15 Nov 2010 10:47:13 -0800

In the Python MR libs, there is a DatastoreKeyInputReader input
reader.  It looks like that is what's used to iterate over the
entities.
http://code.google.com/p/googleappengine/source/browse/trunk/python/google/appengine/ext/datastore_admin/delete_handler.py#148




Robert








On Mon, Nov 15, 2010 at 13:27, Stephen Johnson <[email protected]> wrote:
> Yes I see what you're saying. Map Reduce would bring over the whole entity
> even though it isn't needed and would consume more CPU in fetching the
> entity not just the key. Seems almost like it would be nice to have an
> option of Map Reduce only handing off keys and leaving out the entity.
>
> On Sun, Nov 14, 2010 at 11:18 PM, Eli Jones <[email protected]> wrote:
>>
>> This is just an anecdotal aside (in other words, I have not bothered to do
>> any testing or comparison of performance).. but.. I have my own utility code
>> that I use for batch deletes.
>> Recently, I decided to wipe out all of the entities for one of my models,
>> but I was too lazy to look up the exact command I needed to use in the
>> remote console.
>> So, I just used the new Datastore Admin page to delete them.  This page
>> uses map reduce jobs to perform deletes.
>> From what I could tell, the map reduce delete job took up several times
>> more CPU time (and wall clock time) than my custom delete job usually took.
>> My usual utility class uses this method for deletes:
>> 1. Create a query for all entities in a model with keys_only = True.
>> 2. Fetch 100 keys.
>> 3. Issues a deferred task to delete those 100 key names.
>> 4. Use a  cursor to fetch 100 more, and issue deferred deletes until the
>> query returns no more entities.
>> This is usually pretty fast.. since the only bottle neck is the time it
>> takes to fetch 100 key names and add the deferred task.  The surprising fact
>> was that the default map reduce delete from the Datastore Admin page took so
>> much for CPU.
>> So, if you think you'll be doing more bulk deletes in the future, it might
>> be useful to compare the CPU usage of a map reduce delete (using keys only
>> and not full entities) to a method that deletes batches of 100 key names
>> using deferred with a query cursor.
>> Though, deleting 300,000 entities will take up a lot of CPU hours no
>> matter what method you use.
>> Like I said.. this is anecdotal and there could be no real difference in
>> performance.. but the Datastore Admin delete took up way more CPU time than
>> it seemed it should have, and I didn't bother to use it or test it again.
>>
>> On Sun, Nov 14, 2010 at 11:47 PM, Erik <[email protected]> wrote:
>>>
>>> Thanks for the well thought response, numbers, and reality check
>>> Stephen!  That makes a lot of sense when you consider parallel deletes
>>> and datastore CPU time.
>>>
>>> On Nov 14, 9:37 pm, Stephen Johnson <[email protected]> wrote:
>>> > Thank you for sharing your numbers with us. I think it's a good way for
>>> > all
>>> > of us to get an idea of how much things cost on the cloud, so here's my
>>> > thoughts.
>>> >
>>> > Even though you had one shard executing the shard should be doing batch
>>> > deletes and not one delete at a time. From the documentation batch
>>> > deletes
>>> > can do up to 500 entities in one call and would execute in parallel
>>> > (perhaps
>>> > not 500 all at once but with parallelism none the less). I would assume
>>> > the
>>> > shard would probably do about 100 or so at a time (maybe more / maybe
>>> > less).
>>> >
>>> > Anyway, a good way to prove some parallelism must be occurring would be
>>> > to
>>> > do a proof by negation. So, let's assume that in fact the shard is
>>> > doing one
>>> > delete at a time. Looking at the System Status the latency of a single
>>> > delete on an entity (probably a very simple entity with no composite
>>> > indexes
>>> > which would add additional overhead) is approximately 50ms to 100ms or
>>> > so.
>>> > If we assume 50ms per delete for latency we end up with (assuming no
>>> > overhead for mapreduce/shard maintenance and spawning additional tasks,
>>> > etc.
>>> > which would add even more additional time).
>>> >
>>> >     300000 entities * .05 seconds per entitiy = 15000 seconds
>>> >     15000 seconds / 60 seconds per minute = 250 minutes or 4 hours 10
>>> > minutes
>>> >
>>> > Additionally if a delete takes approximately 100 milliseconds then
>>> > 300000
>>> > entities would take 8 hours 20 minutes to complete.
>>> > Even an unrealistic 25ms per delete is still over two hours.
>>> >
>>> > Now remember this is latency (real time) and not CPU time. So even if
>>> > something has latency time of 50ms it could still eat up 100ms of API
>>> > CPU
>>> > time. For example 50ms to delete the entity and 50ms to update the
>>> > indexes
>>> > (done in parallel). So if latency time is 4 hours 10 minutes and we
>>> > just
>>> > double latency time to approximate API CPU time we get over 8 hours of
>>> > CPU
>>> > time. If average delete time for your job was 75ms then latency time is
>>> > approximately 6 hours and CPU time 12 hours. Your total was 11 hours
>>> > billed
>>> > time so if my logic is sound it seems reasonable the amount you were
>>> > billed
>>> > could be correct.
>>> >
>>> > Furthermore if we take another look at this from another angle we find
>>> > that
>>> > if your delete job took 15 minutes to complete then:
>>> >
>>> > 300000 entities / 15 minutes = 20000 entities per minute
>>> > 20000 entities per minute / 60 seconds per minute = 333.33 entities per
>>> > second
>>> >
>>> > So, if 333.33 entities are being deleted per second serially then the
>>> > average latency would be 3ms per delete which seems rather unlikely.
>>> >
>>> > My thoughts. Hope it helps (and I hope my math is right),
>>> > Steve
>>> >
>>> > On Sun, Nov 14, 2010 at 2:57 PM, Erik <[email protected]> wrote:
>>> >
>>> > > On Nov 14, 1:32 pm, Stephen Johnson <[email protected]> wrote:
>>> > > > Why do you say that's silly? If your map reduce task does bulk
>>> > > > deletes
>>> > > and
>>> > > > let's say they do 100 at a time, then those 100 deletes are done in
>>> > > > parallel. So that's 100x. So for each second of delete real time
>>> > > > you're
>>> > > > getting 100 seconds of CPU time.  You should be pleased that
>>> > > > instead of
>>> > > your
>>> > > > task taking 11 hours to delete all your data it took only 15
>>> > > > minutes.
>>> > > Isn't
>>> > > > that scalability? Isn't that what you're looking for? How many
>>> > > > entities
>>> > > did
>>> > > > you delete? How many indexes did you have (composite and single
>>> > > property)?
>>> >
>>> > > This was using only 1 shard per kind that was being deleted, so
>>> > > effectively there should be no parallelism occurring, unless there is
>>> > > something I am missing?
>>> > > Deleted about ~300k entities, each with a single indexed collection.
>>> >
>>> > > > On Sun, Nov 14, 2010 at 10:29 AM, Erik <[email protected]>
>>> > > > wrote:
>>> >
>>> > > > > If you check in the datastore viewer you might be able to find
>>> > > > > and
>>> > > > > delete your jobs from one of the tables.  You may also need to go
>>> > > > > into
>>> > > > > your task queues and purge the default.
>>> >
>>> > > > > On this topic, why does deleting data have such a large
>>> > > > > difference
>>> > > > > between actual time spent and billed time?
>>> >
>>> > > > > For instance, I had two mapreduce shards running to delete data,
>>> > > > > which
>>> > > > > took a combined a total of 15 minutes, but I was actually charged
>>> > > > > for
>>> > > > > 11(!) hours.  I know there isn't a 1:1 correlation but a >40x
>>> > > > > difference is a little silly!
>>> >
>>> > > > > On Nov 14, 4:25 am, Justin <[email protected]> wrote:
>>> > > > > > I've been trying to bulk delete data from my application as
>>> > > > > > described
>>> > > > > > here
>>> >
>>> >
>>> > > >http://code.google.com/appengine/docs/python/datastore/creatinggettin...
>>> >
>>> > > > > > This seems to have kicked off a series of mapreduce workers,
>>> > > > > > whose
>>> > > > > > execution is killing my CPU - approximately 5 mins later I have
>>> > > > > > reached 100% CPU time and am locked out for the rest of the
>>> > > > > > day.
>>> >
>>> > > > > > I figure I'll just delete by hand; create some appropriate
>>> > > > > > :delete
>>> > > > > > controllers and wait till the next day.
>>> >
>>> > > > > > Unfortunately the mapreduce process still seems to be running -
>>> > > > > > 10
>>> > > > > > past midnight and my CPU has reached 100% again.
>>> >
>>> > > > > > Is there some way to kill these processes and get back control
>>> > > > > > of my
>>> > > > > > app?
>>> >
>>> > > > > --
>>> > > > > You received this message because you are subscribed to the
>>> > > > > Google
>>> > > Groups
>>> > > > > "Google App Engine" group.
>>> > > > > To post to this group, send email to
>>> > > > > [email protected]
>>> > > .
>>> > > > > To unsubscribe from this group, send email to
>>> > > > >
>>> > > > > [email protected]<google-appengine%[email protected]>
>>> > >
>>> > > <google-appengine%[email protected]<google-appengine%[email protected]>
>>> >
>>> > > > > .
>>> > > > > For more options, visit this group at
>>> > > > >http://groups.google.com/group/google-appengine?hl=en.
>>> >
>>> > > --
>>> > > You received this message because you are subscribed to the Google
>>> > > Groups
>>> > > "Google App Engine" group.
>>> > > To post to this group, send email to
>>> > > [email protected].
>>> > > To unsubscribe from this group, send email to
>>> > >
>>> > > [email protected]<google-appengine%[email protected]>
>>> > > .
>>> > > For more options, visit this group at
>>> > >http://groups.google.com/group/google-appengine?hl=en.
>>>
>>> --
>>> You received this message because you are subscribed to the Google Groups
>>> "Google App Engine" group.
>>> To post to this group, send email to [email protected].
>>> To unsubscribe from this group, send email to
>>> [email protected].
>>> For more options, visit this group at
>>> http://groups.google.com/group/google-appengine?hl=en.
>>>
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "Google App Engine" group.
>> To post to this group, send email to [email protected].
>> To unsubscribe from this group, send email to
>> [email protected].
>> For more options, visit this group at
>> http://groups.google.com/group/google-appengine?hl=en.
>
> --
> You received this message because you are subscribed to the Google Groups
> "Google App Engine" group.
> To post to this group, send email to [email protected].
> To unsubscribe from this group, send email to
> [email protected].
> For more options, visit this group at
> http://groups.google.com/group/google-appengine?hl=en.
>

-- 
You received this message because you are subscribed to the Google Groups 
"Google App Engine" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/google-appengine?hl=en.

Re: [google-appengine] Re: Bulk data deletion woe

Reply via email to