Hi fellow developers, just a cautionary tale for the new members out
there and people building up large datasets.

We already know that the difference in reported datastore size between
the actual data and the total size is due to the indexes and various
voodoo stuff that the datastore is doing to maintain our data safe. It
is even more relevant when you are trying to migrate your data out of
GAE or simply delete your data in bulk.

I was storing about 500 GB of data, translated into > 2 TB of data in
the datastore (x4...). After spending days to reprocess most of this
data to remove the unused indexes (and thus losing flexibility in my
Queries and cost me a few hundreds $), it went down to 1.6TB, still
costing me about $450 / month for storage alone. Important note is
that a lot of this data comes from individual small entities (about 1
billion of them), coming from reports and stuff. I don't deny that i
could have come up with a better design, and my latest codebase stores
the data in more efficient ways (aggregating into serialized Text or
Blobs), but I still have to make do for the v1 data set sitting there.

I started a migration of the data out of GAE into a simple MySQL
instance running on EC2. In reality, after migration, the entire
dataset only weighs < 150GB (including indexes) into MySQL so i have
no idea where the extra TB is coming from. The migration process was a
pain in the a** and took me 5 freaking weeks to complete. I tried the
bulk export from python which sucks because it only exports textual
data and integers but skips blobs and binary data (It seems they don't
learn base 64 encoding at google...). So i resorted to the remote API
after a quick email chat with Greg d'Alesandre and Ikai Lan which
basically concluded by "sorry cannot help and remote api is not a
solution". Cool then what is ? The remote API is damn slow and
expensive: I had to basically read the entities one by one, store the
extracted file somewhere and process it on the fly with backups and
failsafe everywhere because the GAE remote api will just break from
time to time (due to datastore exceptions mostly). The extraction job
had to be restarted a couple of time because of cursors being screwed
up. So reading 1 billion entities from datastore takes weeks and costs
a lot of dough. But then comes the axe: your data is still sitting on
GAE and you have to delete it. With 1 billion entries in the
datastore, a x3 / x4 writing factor, it will cost you 2-3 k$ to  empty
your das bin.. I seriously don't mind paying for datastore writes, but
having to pay $2000 to delete data that already costs me $450 / month
is seriously pushing it.

Any mysql / nosql solution that i know of have some sort of flushing
mechanism that doesn't require deletion of each entry 1 by 1. How come
the datastore doesn't ? I am not paying the outrageous $500 / month of
support but I'm paying far more in platform usage (i have an open
credit of 300$ / day) and so far i didn't get any satisfying answer or
support from the GAE team. I love the platform but seriously knowing
what i know now, vendor lockin has never rang so true than with GAE
and I would not commit so much time and energy on GAE for my big/
serious projects, just leaving it to small quick and dirty jobs.

Please share and comment.

Cheers

-- 
You received this message because you are subscribed to the Google Groups 
"Google App Engine" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/google-appengine?hl=en.

Reply via email to