I can't suggest much without seeing the actual code being used in the task
to do the updating.
In the least, it seems like you are doing double the work necessary (from
how you described the process.. again, seeing source code would eliminate
guesswork and make everything clear.).
>From what I understand, when a contact is updated.. a new revision of the
contact is created.. AND then the "old" contact is moved to some other
revision container or something like that.
Seems like you could do just this instead (though, you may already be doing
this):
class metaContact(db.Model):
ContactPosition = db.IntegerProperty()
ContactKeyName = db.StringProperty()
ContactRevision = db.StringProperty()
class Contact(db.Model):
Revision = db.StringProperty()
KeyName = db.StringProperty()
# Other properties
The ultimate key_name for the Contact model would just be KeyName + '_' +
Revision
Zero pad Revision and KeyName so they can be lexically sorted.
When a contact is updated, a new Contact entity is created with all the
updated (and old) properties with the revision property increased by 1.
Then you just update the metaContact model point to the new ContactRevision
value.
If they want to see all Revisions of a Contact, you just fetch all Contact
entities for the given ContactKeyName.
Again, I avoid transactions at all costs... so I'm suggesting you do the
update without a transaction. Now, if you really wanted to avoid the off
chance of a Contact entity getting inserted.. but some error preventing
metaContact from updating the ContactRevision property... you could still do
that in a transaction. But, you're adding a lot of overhead.
In the end, you could never update all 50,000 contacts in one big
transaction.. so you would always be presented with the spectre (however
unlikely) of a certain percentage of the Contacts getting updated with the
correct revision.. but the remaining ones might not get updated (due to some
error).. so you would still need to have some sort of rollback or
reconciliation process to handle that.
But again, that error state would only come up if the task for updating
Contact and metaContact had a permanent failure (since the task would just
blindly retry the updates.. and the Contact, metaContact updates would be
idempotent so you wouldn't have multiple versions of a contact getting
created on task retries.)
Really, since you are using contact revisions.. it seems like avoiding
transactions is the way to go.. you don't risk losing historical contact
data since old revisions are not deleted (in the scenario I'm describing
here).. You're just creating new contact revisions on each update.
So, you could still use your sharded counter method to show a progress bar
to the user.. and presuming the tasks for updating each contact (or,
ideally, batch of contacts) eventually succeeds.. this method would work
much faster.
Worse case scenario, you could have two people managing the same contact
group.. and they would each submit an update to the same contact revision..
and the last update would be the winner.
But, if that was a common issue, you could still have the update occur in a
transaction (by giving metaContact and Contact a parent-child
relationship).. where.. they submit the fields they want to update.. the
task grabs the current revision in a transaction, updates the changed
fields, and puts the new revision information.
As for your question about keeping tasks under 1000ms, that is true and not
true. You want your tasks to run as fast as possible, but you also want to
batch gets and puts as much as possible. (This will save you a lot of CPU
time and wall clock time.)
So, whatever you do with your code, you should experiment a little with
doing these updates in different batch sizes.
P.S. I had this chunk nestled in the middle of my message.. but moved it to
the end since its just a wild speculation of what one could do with the
Matcher and Channel APIs:
If you felt really pumped up, maybe you could cook up a way to use the
Channel API and the Matcher API to push the update status to the user's
browser.
In other words, maybe the browser could enter a "updating" state and
subscribe to an update Channel for that particular user (You would need to
do this before the actual update tasks were added to the queue).. with a
list of Contacts it's waiting to see have been updated to a certain
revision.. then you could have some Matcher code that watches for updates to
that user's contacts.. when an update comes in, it sends a message to the
user's Channel saying, "Contact with KeyName = 'Blah' was updated to
Revision = 'Bleh'".. and the browser would just wait until it received
notification of updates to all the contacts in the update list. Though,
this presumes that you can depend on Matcher and Channel to never drop a
message or miss a match.
On Mon, Nov 29, 2010 at 4:03 PM, dflorey <[email protected]> wrote:
> Thanks Eli for your time helping me out.
> You can have a look at the app I'm working on:
> http://ucm.floreysoft.net
> It's a contact manager app that will keep track of all changes to
> create a revision history.
> It a user has about 50000 contacts and e.g. adds all contacts to a new
> group or changes some fields on all contacts I have to create a new
> entity for the current version and shift the previous version to the
> revision history (revisions are child of master record to be able to
> handle the operation as an atomic commit),
> So in fact I have to create two entities in a single transaction and
> on top of that I have to make some calls to the Google contact servers
> in order to sync changes.
> So what I did is to create a task for each individual command executed
> by the user and put the task to the task queue.
> In worst case if a user modifies all 50000 contacts I'll add 50000
> tasks to the task queue.
> I'm using sharded counters to see how many tasks have already been
> executed to display a progess bar to the user.
> I guess it would speed things up if I would batch some operations per
> tasks?
> I've read somewhere that it is important to keep tasks under 1000ms -
> so is it recommended to perform as many operations in a single task
> that can be done in less than a second?
>
> Thanks again!
>
> Daniel
>
>
> On 29 Nov., 19:15, Eli Jones <[email protected]> wrote:
> > I don't use transactions so I can't help you there. But, it seems like
> > trying to do big batch puts (say, on 50 parent entities and 50 child
> > entities at the same time) in a transaction would introduce a lot of
> > overhead. Only way to know for sure is to test.
> >
> > Also, if you check the mapreduce page, it mentions that transactions are
> on
> > the roadmap (but not yet supported):
> >
> > http://code.google.com/p/appengine-mapreduce/
> >
> > <http://code.google.com/p/appengine-mapreduce/>I know that there are
> > technical limits to how much datastore putting you can do at once. But,
> I
> > haven't hit that limit except when putting around 330 entities per second
> > for 10 minutes straight. The average size of each entity was about 2KB,
> and
> > the puts were only batches of 10 since there was a lot of up front
> > processing that needed to be done first. This was done using my own
> method
> > of fanning out tasks.
> >
> > You should be able to do 100,000 entities extremely fast (without
> > transactions). If you can design a modification/put function that can
> > update 100 entities in under 1 second and you used a queue that processed
> at
> > 30/s.. you could do all the modifications in 33 seconds.
> >
> > How quickly could you pre-process 10 entities for a transactional put?
> > Depending on the size of them.. you could do the 100,000 entities in 5
> > minutes at a rate of 30/s.. but I'm guessing about the transactional put
> on
> > 5 parent and 5 children entities happening in under 1 second. Either
> way,
> > to speed up your update... you want to do these puts in batches of more
> than
> > 1 parent/child pair.
> >
> > Also, what are you using sharded counters for? How many entities does
> one
> > task update right now? The fact that it can take 1 hour to do 100,000
> > entities suggests the process is extremely inefficient.
> >
> > Again, without seeing the code (or at least a pseudo-code outline) for
> what
> > you are doing.. there is no way to really help you figure out a
> > straightforward speedup of your process.
> >
> >
> >
> >
> >
> >
> >
> > On Mon, Nov 29, 2010 at 12:21 PM, dflorey <[email protected]>
> wrote:
> > > Thanks a lot for your valuable replies!
> > > I'll have to check out the current state of the map reduce lib as I
> > > remember from Google IO that it does not support certain filters etc.
> > > Simple question though: What is the maximum of updated entities/minute
> > > inside a transaction that you have seen in the real world?
> >
> > > On 29 Nov., 18:04, Eli Jones <[email protected]> wrote:
> > > > You mention that "tasks get rescheduled for some reason".. what is
> the
> > > > reason? Does this reason occur frequently?
> >
> > > > Also, there is no way to evaluate how fast you can perform your
> > > > modifications since you haven't shown the code that you are currently
> > > using.
> >
> > > > There may be several simple tweaks to your existing code that could
> make
> > > it
> > > > much faster.
> >
> > > > On Mon, Nov 29, 2010 at 9:29 AM, dflorey <[email protected]>
> > > wrote:
> > > > > Thanks for your response. I though that mapreduce will also sit on
> top
> > > > > of task queue and will most likely give any speed improvements over
> my
> > > > > approach?
> > > > > I am seeing ~1500 tasks per minute getting executed. Will mapreduce
> > > > > give higher numbers?
> >
> > > > > Daniel
> >
> > > > > On 29 Nov., 10:41, Peter Ondruska <[email protected]>
> wrote:
> > > > > > I would you mapreduce for GAE, seehttp://
> > > > > code.google.com/p/appengine-mapreduce/.
> > > > > > It has been integrated with latest SDK so no need to download, I
> use
> > > > > > it with Python--just make sure to import
> > > > > > google.appengine.ext.mapreduce.
> >
> > > > > > On 29 lis, 10:06, dflorey <[email protected]> wrote:
> >
> > > > > > > Hi,
> > > > > > > I'm looking for the most effective way to update 50000 entities
> +
> > > one
> > > > > > > of the child entities each.
> > > > > > > Right now I'm using a task per transaction to be able to modify
> the
> > > > > > > entity and the child entities inside a transaction to make the
> task
> > > > > > > idempotent.
> > > > > > > I'm using sharded counters to check when the operation is done.
> > > > > > > Everything works fine, but it takes very long (=minutes to
> hours)
> > > to
> > > > > > > perform the modifications.
> > > > > > > I'm getting no concurrent modification exceptions etc. at all,
> but
> > > > > > > tasks get rescheduled for some reason and wait for a long time
> > > before
> > > > > > > getting executed depending on the number of retries.
> >
> > > > > > > Is there a way to speed things up?
> > > > > > > I'm looking for a solution that will execute the update almost
> > > > > > > immediately :-)
> > > > > > > My tasks take less than 1000ms each and I can see ~30 instances
> in
> > > the
> > > > > > > dashboard.
> >
> > > > > > > Thanks for any ideas,
> >
> > > > > > > Daniel
> >
> > > > > --
> > > > > You received this message because you are subscribed to the Google
> > > Groups
> > > > > "Google App Engine" group.
> > > > > To post to this group, send email to
> [email protected]
> > > .
> > > > > To unsubscribe from this group, send email to
> > > > > [email protected]<google-appengine%[email protected]><google-appengine%2Bunsubscrib
> [email protected]><google-appengine%2Bunsubscrib
> > > [email protected]>
> > > > > .
> > > > > For more options, visit this group at
> > > > >http://groups.google.com/group/google-appengine?hl=en.
> >
> > > --
> > > You received this message because you are subscribed to the Google
> Groups
> > > "Google App Engine" group.
> > > To post to this group, send email to [email protected]
> .
> > > To unsubscribe from this group, send email to
> > > [email protected]<google-appengine%[email protected]><google-appengine%2Bunsubscrib
> [email protected]>
> > > .
> > > For more options, visit this group at
> > >http://groups.google.com/group/google-appengine?hl=en.
>
> --
> You received this message because you are subscribed to the Google Groups
> "Google App Engine" group.
> To post to this group, send email to [email protected].
> To unsubscribe from this group, send email to
> [email protected]<google-appengine%[email protected]>
> .
> For more options, visit this group at
> http://groups.google.com/group/google-appengine?hl=en.
>
>
--
You received this message because you are subscribed to the Google Groups
"Google App Engine" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to
[email protected].
For more options, visit this group at
http://groups.google.com/group/google-appengine?hl=en.