Re: [google-appengine] Re: Best way to update 400,000 entities at once?

Barry Hunter Wed, 05 Mar 2014 07:15:37 -0800

On 5 March 2014 14:24, Lorenzo Bugiani <[email protected]> wrote:


> I haven't understand what you can do with __scatter__ property.
> As MapReduce docs says, "*We do not allow retrieving this value directly
> (it's stripped from the entity before it's returned to the application by
> the Datastore)*"...
>

Suppose you want to break the whole dataset into N number of shards.

In theory, if you just take the first N keys from a keys only query *sorted
*by __scatter__, you get N keys evenly spread out thoughtout the whole
dataset.

Each of those keys can then be used as 'first' key in a standard datastore
query, to get all documents in that shard. (using a greater than __key__
filter, with the results in __key__ order). Works sot of similar to how a
cursor actully works under the hood.



>
> Also, I haven't understood what's wrong with using tasks and cursors.
>

The main 'advantage' of using the __scatter__ is, can get all the data to
setup ALL the tasks at once. Say you want 200 shards. One query retrieving
200 keys, and you can immidiately create all 200 tasks*. You can then begin
processing those tasks in any order, even concurrently. (althouh if you
have lots of shards to create, may use cursors for 'looping' that initial
query!)

Using a 'while loop' and cursors. You have to run each query, and get all
the data (keys only or the actual documents), to get the cursor to begin
the next task. Even if you do keys only queries, to get all the cursors
(and add the cursors to the task, so they can get all the documents for
real), you downloading much more data than you need.


So with the 'nibble away' approach, you either have to be very inefficient
in creating all the initial tasks (so can get a progress report), or just
have to create the 'next task' after running the for-real query (ie get the
next cursor in the query) - in which case cant parallelize.


Both approaches have their pro's and cons. One is much simpler and easier
to understand, the other is more complex, but should be more efficient.




* actully the mapreduce lib, does oversample to improve the results, so its
not quite that efficient.


> A task can simply iterate the data, fetching N entities at time (in this
> case 1000), than launch a task that can update them (by itself or splitting
> again the work). At most, if 10 minutes aren't so much to iterate over all
> data, this "main" task can fetch only a piece of data, start update task,
> than start itself on the next chunk of data...
>

Yes, for small batch runs, its not going to make a big difference, but the
bigger the whole dataset to be iterated, the more any efficiency gains have
an affect.



>
> Obviously if is possible to split the work using only key's informations
> (for example, keys are number from 1 to 400.000) this works better, because
> fetching data by key is always a better than querying for the same data...
>

That's what the __scatter__ does allow :)

-- 
You received this message because you are subscribed to the Google Groups 
"Google App Engine" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/google-appengine.
For more options, visit https://groups.google.com/groups/opt_out.

Re: [google-appengine] Re: Best way to update 400,000 entities at once?

Reply via email to