The cold hearted bastard in me has the following thoughts. You wrote code that treated DataStore Like SQL. You didn't set Do Not index on the things you didn't need to index. You changed the structure of your data midway but didn't flush and start over you just changed. Likely you aren't doing any clean up. Likely you aren't using the right typing for your data.
So what I hear is "Whine, whine, whine, I built my stuff wrong, Google Tried to help me but I wanted to move to Amazon so they didn't have many suggestions I liked, so now I'm sad, whine, whine, whine, woe is me. Please tell others so I can get sympathy for not understanding the platform I was working on." Did I miss anything? -----Original Message----- From: [email protected] [mailto:[email protected]] On Behalf Of Yohan Sent: Tuesday, December 27, 2011 5:44 PM To: Google App Engine Cc: [email protected] Subject: [google-appengine] Cautionary Tale: Abusive price for data migration and deletion Hi fellow developers, just a cautionary tale for the new members out there and people building up large datasets. We already know that the difference in reported datastore size between the actual data and the total size is due to the indexes and various voodoo stuff that the datastore is doing to maintain our data safe. It is even more relevant when you are trying to migrate your data out of GAE or simply delete your data in bulk. I was storing about 500 GB of data, translated into > 2 TB of data in the datastore (x4...). After spending days to reprocess most of this data to remove the unused indexes (and thus losing flexibility in my Queries and cost me a few hundreds $), it went down to 1.6TB, still costing me about $450 / month for storage alone. Important note is that a lot of this data comes from individual small entities (about 1 billion of them), coming from reports and stuff. I don't deny that i could have come up with a better design, and my latest codebase stores the data in more efficient ways (aggregating into serialized Text or Blobs), but I still have to make do for the v1 data set sitting there. I started a migration of the data out of GAE into a simple MySQL instance running on EC2. In reality, after migration, the entire dataset only weighs < 150GB (including indexes) into MySQL so i have no idea where the extra TB is coming from. The migration process was a pain in the a** and took me 5 freaking weeks to complete. I tried the bulk export from python which sucks because it only exports textual data and integers but skips blobs and binary data (It seems they don't learn base 64 encoding at google...). So i resorted to the remote API after a quick email chat with Greg d'Alesandre and Ikai Lan which basically concluded by "sorry cannot help and remote api is not a solution". Cool then what is ? The remote API is damn slow and expensive: I had to basically read the entities one by one, store the extracted file somewhere and process it on the fly with backups and failsafe everywhere because the GAE remote api will just break from time to time (due to datastore exceptions mostly). The extraction job had to be restarted a couple of time because of cursors being screwed up. So reading 1 billion entities from datastore takes weeks and costs a lot of dough. But then comes the axe: your data is still sitting on GAE and you have to delete it. With 1 billion entries in the datastore, a x3 / x4 writing factor, it will cost you 2-3 k$ to empty your das bin.. I seriously don't mind paying for datastore writes, but having to pay $2000 to delete data that already costs me $450 / month is seriously pushing it. Any mysql / nosql solution that i know of have some sort of flushing mechanism that doesn't require deletion of each entry 1 by 1. How come the datastore doesn't ? I am not paying the outrageous $500 / month of support but I'm paying far more in platform usage (i have an open credit of 300$ / day) and so far i didn't get any satisfying answer or support from the GAE team. I love the platform but seriously knowing what i know now, vendor lockin has never rang so true than with GAE and I would not commit so much time and energy on GAE for my big/ serious projects, just leaving it to small quick and dirty jobs. Please share and comment. Cheers -- You received this message because you are subscribed to the Google Groups "Google App Engine" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/google-appengine?hl=en. -- You received this message because you are subscribed to the Google Groups "Google App Engine" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/google-appengine?hl=en.
