Re: [google-appengine] Re: Best way to update 400,000 entities at once?

Kaan Soral Wed, 05 Mar 2014 12:34:38 -0800

3-4 years ago I also shared your enthusiasm, but it's much much more 
logical to just implement a scatter manually, the pre-set probability 
results in a diluted scatter poolset after a while, but the idea of 
scatters is obviously nice


On Wednesday, March 5, 2014 10:27:46 PM UTC+2, Jeff Schnitzer wrote:
>
> Wow, I had no idea this exists. It's brilliant! 
>
> This seems to be the only formal documentation: 
>
>
> https://code.google.com/p/appengine-mapreduce/wiki/ScatterPropertyImplementation
>  
>
> Perhaps this should be added to the main GAE documentation? I never 
> would have found it. 
>
> Jeff 
>
> On Wed, Mar 5, 2014 at 7:32 AM, Lorenzo Bugiani 
> <[email protected]<javascript:>> 
> wrote: 
> > Ok thanks! 
> > 
> > I've heard of __scatter__ now for the first time! :D 
> > 
> > 
> > 2014-03-05 16:14 GMT+01:00 Barry Hunter <[email protected]<javascript:>>: 
>
> > 
> >> 
> >> 
> >> 
> >> On 5 March 2014 14:24, Lorenzo Bugiani <[email protected]<javascript:>> 
> wrote: 
> >>> 
> >>> I haven't understand what you can do with __scatter__ property. 
> >>> As MapReduce docs says, "We do not allow retrieving this value 
> directly 
> >>> (it's stripped from the entity before it's returned to the application 
> by 
> >>> the Datastore)"... 
> >> 
> >> 
> >> Suppose you want to break the whole dataset into N number of shards. 
> >> 
> >> In theory, if you just take the first N keys from a keys only query 
> sorted 
> >> by __scatter__, you get N keys evenly spread out thoughtout the whole 
> >> dataset. 
> >> 
> >> Each of those keys can then be used as 'first' key in a standard 
> datastore 
> >> query, to get all documents in that shard. (using a greater than 
> __key__ 
> >> filter, with the results in __key__ order). Works sot of similar to how 
> a 
> >> cursor actully works under the hood. 
> >> 
> >> 
> >>> 
> >>> 
> >>> Also, I haven't understood what's wrong with using tasks and cursors. 
> >> 
> >> 
> >> The main 'advantage' of using the __scatter__ is, can get all the data 
> to 
> >> setup ALL the tasks at once. Say you want 200 shards. One query 
> retrieving 
> >> 200 keys, and you can immidiately create all 200 tasks*. You can then 
> begin 
> >> processing those tasks in any order, even concurrently. (althouh if you 
> have 
> >> lots of shards to create, may use cursors for 'looping' that initial 
> query!) 
> >> 
> >> Using a 'while loop' and cursors. You have to run each query, and get 
> all 
> >> the data (keys only or the actual documents), to get the cursor to 
> begin the 
> >> next task. Even if you do keys only queries, to get all the cursors 
> (and add 
> >> the cursors to the task, so they can get all the documents for real), 
> you 
> >> downloading much more data than you need. 
> >> 
> >> 
> >> So with the 'nibble away' approach, you either have to be very 
> inefficient 
> >> in creating all the initial tasks (so can get a progress report), or 
> just 
> >> have to create the 'next task' after running the for-real query (ie get 
> the 
> >> next cursor in the query) - in which case cant parallelize. 
> >> 
> >> 
> >> Both approaches have their pro's and cons. One is much simpler and 
> easier 
> >> to understand, the other is more complex, but should be more efficient. 
> >> 
> >> 
> >> 
> >> 
> >> * actully the mapreduce lib, does oversample to improve the results, so 
> >> its not quite that efficient. 
> >> 
> >>> 
> >>> A task can simply iterate the data, fetching N entities at time (in 
> this 
> >>> case 1000), than launch a task that can update them (by itself or 
> splitting 
> >>> again the work). At most, if 10 minutes aren't so much to iterate over 
> all 
> >>> data, this "main" task can fetch only a piece of data, start update 
> task, 
> >>> than start itself on the next chunk of data... 
> >> 
> >> 
> >> Yes, for small batch runs, its not going to make a big difference, but 
> the 
> >> bigger the whole dataset to be iterated, the more any efficiency gains 
> have 
> >> an affect. 
> >> 
> >> 
> >>> 
> >>> 
> >>> Obviously if is possible to split the work using only key's 
> informations 
> >>> (for example, keys are number from 1 to 400.000) this works better, 
> because 
> >>> fetching data by key is always a better than querying for the same 
> data... 
> >> 
> >> 
> >> That's what the __scatter__ does allow :) 
> >> 
> >> 
> >> -- 
> >> You received this message because you are subscribed to the Google 
> Groups 
> >> "Google App Engine" group. 
> >> To unsubscribe from this group and stop receiving emails from it, send 
> an 
> >> email to [email protected] <javascript:>. 
> >> To post to this group, send email to 
> >> [email protected]<javascript:>. 
>
> >> Visit this group at http://groups.google.com/group/google-appengine. 
> >> For more options, visit https://groups.google.com/groups/opt_out. 
> > 
> > 
> > -- 
> > You received this message because you are subscribed to the Google 
> Groups 
> > "Google App Engine" group. 
> > To unsubscribe from this group and stop receiving emails from it, send 
> an 
> > email to [email protected] <javascript:>. 
> > To post to this group, send email to 
> > [email protected]<javascript:>. 
>
> > Visit this group at http://groups.google.com/group/google-appengine. 
> > For more options, visit https://groups.google.com/groups/opt_out. 
>

-- 
You received this message because you are subscribed to the Google Groups 
"Google App Engine" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/google-appengine.
For more options, visit https://groups.google.com/groups/opt_out.

Re: [google-appengine] Re: Best way to update 400,000 entities at once?

Reply via email to