On Thu, Mar 11, 2010 at 8:56 PM, John Patterson <jdpatter...@gmail.com> wrote:
>
> But for typesafe changes large or small Twig supports data migration in a
> much safer, more flexible way than Objectify.  Read on for details.

You are increasing my suspicion that you've never actually performed
schema migrations on big, rapidly changing datasets.

> Cool, the @AlsoLoad is quite a neat feature.  Although very limited to
> simple naming changes and nothing structural.  All this is based on a
> dangerous assumption that you can modify "live" data in place.  Hardly
> bullet proof.

Actually, @AlsoLoad (in conjunction with @LoadOnly and the @PrePersist
and @PostLoad lifecycle callbacks) provides an enormous range of
ability to transform your data.  I know, I've had to do more of it
than I would like to admit.  You can rename fields, change types,
arbitrarily munge data, split entities into multiple parts, combine
multiple entities into one, convert between child entities and
embedded parts, etc.

In most cases you can do this on a live running system.  That is the
entire point, actually - our goal is zero downtime for schema
migration.  The general approach:

 * Modify your entities to save in your new format.
 * Use Objectify's primitives so that data loads in both the old
format and the new format.
 * Test your code against your local datastore, or if you're deeply
concerned, against exported data in another appid.
 * Deploy your new code, letting the natural churn update your database.
 * Fire off a batch job at your leisure to finish it off.
 * Remove the extra loading logic from your code when you're done.

Not every migration works exactly the same way, but the tools are
there.  I know from experience that it works and works well.

> The Twig solution is to create a new version of the type (v2) and process
> your changes while leaving the live data completely isolated and safe.  Then
> after you have tested your changes you bump up the version number of your
> live app.

This is cumbersome and inelegant compared to Objectify's solution.
You require the developers to 1) create a parallel hierarchy of
classes and 2) create code (possibly scattered across the app) to
write out both formats.  You require a complete duplication of the
datastore kind - potentially billions of entities occupying hell only
knows how much space.  It could take *weeks* to do even minor schema
migrations this way.  And if you want to make another minor change
halfway through the process?  Start from scratch!  In the mean time,
your customers are wondering why the new feature isn't live yet.

Also... do you realize how slow and expensive deletes are in
appengine?  Duplicating the database is just not an option.  Not with
the Mobcast 2.0 dataset (not live yet, I should be able to talk about
it more freely in a month or two).  Certainly not with Scott's
dataset, which may end up caching a significant chunk of Flickr,
Picasa, and Facebook if it takes off.

> What is with your obsession with batch gets?  I understand they are central
> in Objectify because you are always loading keys.  As I said already - even
> though this is not as essential in Twig it will be added to a new load
> command.

Batch gets are *the* core feature of NoSQL databases, including the
GAE datastore.  Look at these graphs:

http://code.google.com/status/appengine/detail/datastore/2010/03/12#ae-trust-detail-datastore-get-latency
http://code.google.com/status/appengine/detail/datastore/2010/03/12#ae-trust-detail-datastore-query-latency

Notice that a get()'s average latency is 50ms and a query()'s average
latency is 500ms.  Last week the typical query was averaging
800-1000ms with frequent spikes into 1200ms or so.

Deep down in the fiber of its being, BigTable is a key-value store.
It is very very efficient at doing batch gets.  It wants to do batch
gets all day long.  Queries require touching indexes maintained in
alternative tablets and comparatively, the performance sucks.

I'm by no means a BigTable expert, but I have a significant
professional interest in being able to read & write a lot of data.  I
could not implement (perhaps better said I couldn't scale) Mobcast
without batch gets and sets.

To be honest, I'm not wholly thrilled with the performance of batch
get/put operations on appengine either.  Cassandra folks are claiming
10k/s writes *per machine*.  Tokyo Tyrant folks are claiming 20k+/sec
writes.  Reads are even faster!  True, these systems are not as
full-featured as the appengine datastore... but we're talking at least
two full orders of mangitude difference!  Ouch.

Why am I obsessed with batch gets?  Because they're essential for
making an application perform.  They're why there is such a thing as a
NoSQL movement in the first place.

> Oops I didn't post the CookBook page in the end.  Rest assured it is a
> trivial addition and I'll update the docs.
> It is also often better to cache above the data layer - hardly the killer
> feature you claim.

If you have a read-heavy app (and most are), nothing gives you
bang-for-the-buck like adding one little annotation and pulling your
data out of memcache instead of the datastore.  Caching at higher
levels *might* save you some additional cpu cycles, but it's certainly
a lot more work.

Jeff

-- 
You received this message because you are subscribed to the Google Groups 
"Google App Engine for Java" group.
To post to this group, send email to google-appengine-j...@googlegroups.com.
To unsubscribe from this group, send email to 
google-appengine-java+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/google-appengine-java?hl=en.

Reply via email to