In the Python MR libs, there is a DatastoreKeyInputReader input reader. It looks like that is what's used to iterate over the entities. http://code.google.com/p/googleappengine/source/browse/trunk/python/google/appengine/ext/datastore_admin/delete_handler.py#148
Robert On Mon, Nov 15, 2010 at 13:27, Stephen Johnson <[email protected]> wrote: > Yes I see what you're saying. Map Reduce would bring over the whole entity > even though it isn't needed and would consume more CPU in fetching the > entity not just the key. Seems almost like it would be nice to have an > option of Map Reduce only handing off keys and leaving out the entity. > > On Sun, Nov 14, 2010 at 11:18 PM, Eli Jones <[email protected]> wrote: >> >> This is just an anecdotal aside (in other words, I have not bothered to do >> any testing or comparison of performance).. but.. I have my own utility code >> that I use for batch deletes. >> Recently, I decided to wipe out all of the entities for one of my models, >> but I was too lazy to look up the exact command I needed to use in the >> remote console. >> So, I just used the new Datastore Admin page to delete them. This page >> uses map reduce jobs to perform deletes. >> From what I could tell, the map reduce delete job took up several times >> more CPU time (and wall clock time) than my custom delete job usually took. >> My usual utility class uses this method for deletes: >> 1. Create a query for all entities in a model with keys_only = True. >> 2. Fetch 100 keys. >> 3. Issues a deferred task to delete those 100 key names. >> 4. Use a cursor to fetch 100 more, and issue deferred deletes until the >> query returns no more entities. >> This is usually pretty fast.. since the only bottle neck is the time it >> takes to fetch 100 key names and add the deferred task. The surprising fact >> was that the default map reduce delete from the Datastore Admin page took so >> much for CPU. >> So, if you think you'll be doing more bulk deletes in the future, it might >> be useful to compare the CPU usage of a map reduce delete (using keys only >> and not full entities) to a method that deletes batches of 100 key names >> using deferred with a query cursor. >> Though, deleting 300,000 entities will take up a lot of CPU hours no >> matter what method you use. >> Like I said.. this is anecdotal and there could be no real difference in >> performance.. but the Datastore Admin delete took up way more CPU time than >> it seemed it should have, and I didn't bother to use it or test it again. >> >> On Sun, Nov 14, 2010 at 11:47 PM, Erik <[email protected]> wrote: >>> >>> Thanks for the well thought response, numbers, and reality check >>> Stephen! That makes a lot of sense when you consider parallel deletes >>> and datastore CPU time. >>> >>> On Nov 14, 9:37 pm, Stephen Johnson <[email protected]> wrote: >>> > Thank you for sharing your numbers with us. I think it's a good way for >>> > all >>> > of us to get an idea of how much things cost on the cloud, so here's my >>> > thoughts. >>> > >>> > Even though you had one shard executing the shard should be doing batch >>> > deletes and not one delete at a time. From the documentation batch >>> > deletes >>> > can do up to 500 entities in one call and would execute in parallel >>> > (perhaps >>> > not 500 all at once but with parallelism none the less). I would assume >>> > the >>> > shard would probably do about 100 or so at a time (maybe more / maybe >>> > less). >>> > >>> > Anyway, a good way to prove some parallelism must be occurring would be >>> > to >>> > do a proof by negation. So, let's assume that in fact the shard is >>> > doing one >>> > delete at a time. Looking at the System Status the latency of a single >>> > delete on an entity (probably a very simple entity with no composite >>> > indexes >>> > which would add additional overhead) is approximately 50ms to 100ms or >>> > so. >>> > If we assume 50ms per delete for latency we end up with (assuming no >>> > overhead for mapreduce/shard maintenance and spawning additional tasks, >>> > etc. >>> > which would add even more additional time). >>> > >>> > 300000 entities * .05 seconds per entitiy = 15000 seconds >>> > 15000 seconds / 60 seconds per minute = 250 minutes or 4 hours 10 >>> > minutes >>> > >>> > Additionally if a delete takes approximately 100 milliseconds then >>> > 300000 >>> > entities would take 8 hours 20 minutes to complete. >>> > Even an unrealistic 25ms per delete is still over two hours. >>> > >>> > Now remember this is latency (real time) and not CPU time. So even if >>> > something has latency time of 50ms it could still eat up 100ms of API >>> > CPU >>> > time. For example 50ms to delete the entity and 50ms to update the >>> > indexes >>> > (done in parallel). So if latency time is 4 hours 10 minutes and we >>> > just >>> > double latency time to approximate API CPU time we get over 8 hours of >>> > CPU >>> > time. If average delete time for your job was 75ms then latency time is >>> > approximately 6 hours and CPU time 12 hours. Your total was 11 hours >>> > billed >>> > time so if my logic is sound it seems reasonable the amount you were >>> > billed >>> > could be correct. >>> > >>> > Furthermore if we take another look at this from another angle we find >>> > that >>> > if your delete job took 15 minutes to complete then: >>> > >>> > 300000 entities / 15 minutes = 20000 entities per minute >>> > 20000 entities per minute / 60 seconds per minute = 333.33 entities per >>> > second >>> > >>> > So, if 333.33 entities are being deleted per second serially then the >>> > average latency would be 3ms per delete which seems rather unlikely. >>> > >>> > My thoughts. Hope it helps (and I hope my math is right), >>> > Steve >>> > >>> > On Sun, Nov 14, 2010 at 2:57 PM, Erik <[email protected]> wrote: >>> > >>> > > On Nov 14, 1:32 pm, Stephen Johnson <[email protected]> wrote: >>> > > > Why do you say that's silly? If your map reduce task does bulk >>> > > > deletes >>> > > and >>> > > > let's say they do 100 at a time, then those 100 deletes are done in >>> > > > parallel. So that's 100x. So for each second of delete real time >>> > > > you're >>> > > > getting 100 seconds of CPU time. You should be pleased that >>> > > > instead of >>> > > your >>> > > > task taking 11 hours to delete all your data it took only 15 >>> > > > minutes. >>> > > Isn't >>> > > > that scalability? Isn't that what you're looking for? How many >>> > > > entities >>> > > did >>> > > > you delete? How many indexes did you have (composite and single >>> > > property)? >>> > >>> > > This was using only 1 shard per kind that was being deleted, so >>> > > effectively there should be no parallelism occurring, unless there is >>> > > something I am missing? >>> > > Deleted about ~300k entities, each with a single indexed collection. >>> > >>> > > > On Sun, Nov 14, 2010 at 10:29 AM, Erik <[email protected]> >>> > > > wrote: >>> > >>> > > > > If you check in the datastore viewer you might be able to find >>> > > > > and >>> > > > > delete your jobs from one of the tables. You may also need to go >>> > > > > into >>> > > > > your task queues and purge the default. >>> > >>> > > > > On this topic, why does deleting data have such a large >>> > > > > difference >>> > > > > between actual time spent and billed time? >>> > >>> > > > > For instance, I had two mapreduce shards running to delete data, >>> > > > > which >>> > > > > took a combined a total of 15 minutes, but I was actually charged >>> > > > > for >>> > > > > 11(!) hours. I know there isn't a 1:1 correlation but a >40x >>> > > > > difference is a little silly! >>> > >>> > > > > On Nov 14, 4:25 am, Justin <[email protected]> wrote: >>> > > > > > I've been trying to bulk delete data from my application as >>> > > > > > described >>> > > > > > here >>> > >>> > >>> > > >http://code.google.com/appengine/docs/python/datastore/creatinggettin... >>> > >>> > > > > > This seems to have kicked off a series of mapreduce workers, >>> > > > > > whose >>> > > > > > execution is killing my CPU - approximately 5 mins later I have >>> > > > > > reached 100% CPU time and am locked out for the rest of the >>> > > > > > day. >>> > >>> > > > > > I figure I'll just delete by hand; create some appropriate >>> > > > > > :delete >>> > > > > > controllers and wait till the next day. >>> > >>> > > > > > Unfortunately the mapreduce process still seems to be running - >>> > > > > > 10 >>> > > > > > past midnight and my CPU has reached 100% again. >>> > >>> > > > > > Is there some way to kill these processes and get back control >>> > > > > > of my >>> > > > > > app? >>> > >>> > > > > -- >>> > > > > You received this message because you are subscribed to the >>> > > > > Google >>> > > Groups >>> > > > > "Google App Engine" group. >>> > > > > To post to this group, send email to >>> > > > > [email protected] >>> > > . >>> > > > > To unsubscribe from this group, send email to >>> > > > > >>> > > > > [email protected]<google-appengine%[email protected]> >>> > > >>> > > <google-appengine%[email protected]<google-appengine%[email protected]> >>> > >>> > > > > . >>> > > > > For more options, visit this group at >>> > > > >http://groups.google.com/group/google-appengine?hl=en. >>> > >>> > > -- >>> > > You received this message because you are subscribed to the Google >>> > > Groups >>> > > "Google App Engine" group. >>> > > To post to this group, send email to >>> > > [email protected]. >>> > > To unsubscribe from this group, send email to >>> > > >>> > > [email protected]<google-appengine%[email protected]> >>> > > . >>> > > For more options, visit this group at >>> > >http://groups.google.com/group/google-appengine?hl=en. >>> >>> -- >>> You received this message because you are subscribed to the Google Groups >>> "Google App Engine" group. >>> To post to this group, send email to [email protected]. >>> To unsubscribe from this group, send email to >>> [email protected]. >>> For more options, visit this group at >>> http://groups.google.com/group/google-appengine?hl=en. >>> >> >> -- >> You received this message because you are subscribed to the Google Groups >> "Google App Engine" group. >> To post to this group, send email to [email protected]. >> To unsubscribe from this group, send email to >> [email protected]. >> For more options, visit this group at >> http://groups.google.com/group/google-appengine?hl=en. > > -- > You received this message because you are subscribed to the Google Groups > "Google App Engine" group. > To post to this group, send email to [email protected]. > To unsubscribe from this group, send email to > [email protected]. > For more options, visit this group at > http://groups.google.com/group/google-appengine?hl=en. > -- You received this message because you are subscribed to the Google Groups "Google App Engine" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/google-appengine?hl=en.
