On 5 March 2014 14:24, Lorenzo Bugiani <[email protected]> wrote:
> I haven't understand what you can do with __scatter__ property. > As MapReduce docs says, "*We do not allow retrieving this value directly > (it's stripped from the entity before it's returned to the application by > the Datastore)*"... > Suppose you want to break the whole dataset into N number of shards. In theory, if you just take the first N keys from a keys only query *sorted *by __scatter__, you get N keys evenly spread out thoughtout the whole dataset. Each of those keys can then be used as 'first' key in a standard datastore query, to get all documents in that shard. (using a greater than __key__ filter, with the results in __key__ order). Works sot of similar to how a cursor actully works under the hood. > > Also, I haven't understood what's wrong with using tasks and cursors. > The main 'advantage' of using the __scatter__ is, can get all the data to setup ALL the tasks at once. Say you want 200 shards. One query retrieving 200 keys, and you can immidiately create all 200 tasks*. You can then begin processing those tasks in any order, even concurrently. (althouh if you have lots of shards to create, may use cursors for 'looping' that initial query!) Using a 'while loop' and cursors. You have to run each query, and get all the data (keys only or the actual documents), to get the cursor to begin the next task. Even if you do keys only queries, to get all the cursors (and add the cursors to the task, so they can get all the documents for real), you downloading much more data than you need. So with the 'nibble away' approach, you either have to be very inefficient in creating all the initial tasks (so can get a progress report), or just have to create the 'next task' after running the for-real query (ie get the next cursor in the query) - in which case cant parallelize. Both approaches have their pro's and cons. One is much simpler and easier to understand, the other is more complex, but should be more efficient. * actully the mapreduce lib, does oversample to improve the results, so its not quite that efficient. > A task can simply iterate the data, fetching N entities at time (in this > case 1000), than launch a task that can update them (by itself or splitting > again the work). At most, if 10 minutes aren't so much to iterate over all > data, this "main" task can fetch only a piece of data, start update task, > than start itself on the next chunk of data... > Yes, for small batch runs, its not going to make a big difference, but the bigger the whole dataset to be iterated, the more any efficiency gains have an affect. > > Obviously if is possible to split the work using only key's informations > (for example, keys are number from 1 to 400.000) this works better, because > fetching data by key is always a better than querying for the same data... > That's what the __scatter__ does allow :) -- You received this message because you are subscribed to the Google Groups "Google App Engine" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/google-appengine. For more options, visit https://groups.google.com/groups/opt_out.
