The cold hearted bastard in me has the following thoughts.

You wrote code that treated DataStore Like SQL.
You didn't set Do Not index on the things you didn't need to index.
You changed the structure of your data midway but didn't flush and start
over you just changed.
Likely you aren't doing any clean up.
Likely you aren't using the right typing for your data.

So what I hear is "Whine, whine, whine, I built my stuff wrong, Google Tried
to help me but I wanted to move to Amazon so they didn't have many
suggestions I liked, so now I'm sad, whine, whine, whine, woe is me.  Please
tell others so I can get sympathy for not understanding the platform I was
working on."

Did I miss anything?



-----Original Message-----
From: [email protected]
[mailto:[email protected]] On Behalf Of Yohan
Sent: Tuesday, December 27, 2011 5:44 PM
To: Google App Engine
Cc: [email protected]
Subject: [google-appengine] Cautionary Tale: Abusive price for data
migration and deletion

Hi fellow developers, just a cautionary tale for the new members out there
and people building up large datasets.

We already know that the difference in reported datastore size between the
actual data and the total size is due to the indexes and various voodoo
stuff that the datastore is doing to maintain our data safe. It is even more
relevant when you are trying to migrate your data out of GAE or simply
delete your data in bulk.

I was storing about 500 GB of data, translated into > 2 TB of data in the
datastore (x4...). After spending days to reprocess most of this data to
remove the unused indexes (and thus losing flexibility in my Queries and
cost me a few hundreds $), it went down to 1.6TB, still costing me about
$450 / month for storage alone. Important note is that a lot of this data
comes from individual small entities (about 1 billion of them), coming from
reports and stuff. I don't deny that i could have come up with a better
design, and my latest codebase stores the data in more efficient ways
(aggregating into serialized Text or Blobs), but I still have to make do for
the v1 data set sitting there.

I started a migration of the data out of GAE into a simple MySQL instance
running on EC2. In reality, after migration, the entire dataset only weighs
< 150GB (including indexes) into MySQL so i have no idea where the extra TB
is coming from. The migration process was a pain in the a** and took me 5
freaking weeks to complete. I tried the bulk export from python which sucks
because it only exports textual data and integers but skips blobs and binary
data (It seems they don't learn base 64 encoding at google...). So i
resorted to the remote API after a quick email chat with Greg d'Alesandre
and Ikai Lan which basically concluded by "sorry cannot help and remote api
is not a solution". Cool then what is ? The remote API is damn slow and
expensive: I had to basically read the entities one by one, store the
extracted file somewhere and process it on the fly with backups and failsafe
everywhere because the GAE remote api will just break from time to time (due
to datastore exceptions mostly). The extraction job had to be restarted a
couple of time because of cursors being screwed up. So reading 1 billion
entities from datastore takes weeks and costs a lot of dough. But then comes
the axe: your data is still sitting on GAE and you have to delete it. With 1
billion entries in the datastore, a x3 / x4 writing factor, it will cost you
2-3 k$ to  empty your das bin.. I seriously don't mind paying for datastore
writes, but having to pay $2000 to delete data that already costs me $450 /
month is seriously pushing it.

Any mysql / nosql solution that i know of have some sort of flushing
mechanism that doesn't require deletion of each entry 1 by 1. How come the
datastore doesn't ? I am not paying the outrageous $500 / month of support
but I'm paying far more in platform usage (i have an open credit of 300$ /
day) and so far i didn't get any satisfying answer or support from the GAE
team. I love the platform but seriously knowing what i know now, vendor
lockin has never rang so true than with GAE and I would not commit so much
time and energy on GAE for my big/ serious projects, just leaving it to
small quick and dirty jobs.

Please share and comment.

Cheers

--
You received this message because you are subscribed to the Google Groups
"Google App Engine" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to
[email protected].
For more options, visit this group at
http://groups.google.com/group/google-appengine?hl=en.


-- 
You received this message because you are subscribed to the Google Groups 
"Google App Engine" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/google-appengine?hl=en.

Reply via email to