Wow! Thanks very much to everyone who posted suggestions and to those who sent me direct replies.
First, please let me say that I believe it's in Google's best interest and my company's for us to keep core portions of our application in AppEngine. Furthermore, given the amount of investment we have in this infrastructure, I intend to pursue any avenues we have to continue using AppEngine. I believe that with Google's help we can find an engineering solution and/or pricing model that allows for both the platform and its customers to be successful. Ok. That said, let me summarize and respond to some of the concerns/ recommendations above: [NickolasD] Is this something you could move into Google Cloud SQL? Yes, but it's not clear what the pricing model will be for CloudSQL and whether it will be any cheaper than AppEngine. [RichardW] Maybe the GAE team should borrow the idea of spot prices from Amazon. Love this idea. It would serve to spread out resource usage, provide market pricing, and benefit all involved. [RichardW] Maybe run your mapreduce on smaller sets of the data to spread it out over multiple days and avoid adding too many instances? As detailed above, the costly component here is the database operations charge wrt large datasets and indexed properties. [sb] Google Cloud SQL looks interesting. but 30 days is not enough notice to respond to changes/decisions that may be made. Totally agree. I get 45 days notice on my rent increases and it takes far less effort for me to change apartments. [de Witte] What if you disable the app for maintenance, doing the following steps... Really interesting suggestion! Would love to hear if someone's tried this. a) We'd really like to avoid turning off the app for the 1-2 days it would take to create the indexes. b) I'm not certain a rebuild would be any cheaper. If it is, that's probably an unintentional pricing discrepancy that I'd prefer not to rely on. [JonS] For our application, we used Geohashing. We used geohasing before geboxing, but it didn't work for us. a) AppEngine requires that a query have at most one inequality comparison, and geohashing uses it. b) We found that geohashing queries were much slower than geoboxing for the same parameters, adding human noticeable delay (>400ms). [VivekP] I have a table with 1.5TB of data. It costs me ten of thousands (one-time) to delete it and a few thousand (per year) to keep it. [Andrin] I use a version of geohashing which only uses the most precise value. Our geohash does the same thing, but the above limitations still exist. [RichardW] What if you had the gps data as children of each entry and then used a keys-only query to match? Love the suggestions! We thought about doing something like this as well, but we'd still have one entity per StringListProperty. That's not so bad, but we'd also need to copy down other properties so we could restrict the query based on other values on the entity. For example: Finding entities within a certain distance sorted by popularity or filtered by user. I'm not certain there would be cost benefits but I am certain it would add substantial complexity to the data + app. [IkaiL] please describe the engineering details and business purpose Provided in an earlier post. Do you need any more info? Any thoughts? We tried registering for Premier support a few days ago but haven't heard back yet. Thanks again, -Corey On Jan 6, 8:50 am, George <[email protected]> wrote: > Corey, > > Did you guys consider something along the lines of SimpleGeo to > outsource your spatial stuff? > > Is there a political or philosophical reason to keep everything inside > of GAE? > > -- George > > On Jan 5, 3:24 pm, "Corey [Firespotter]" <[email protected]> > wrote: > > > > > > > > > I work with Petey on this and can help clarify some of the details. > > > The Entities; > > We have a lot of entities (~14mi) each of which have a > > StringListProperty called "geoboxes". Like so: > > class Place(search.SearchableModel): > > name = db.StringProperty() > > ... > > # Location specific fields. > > coordinates = db.GeoPtProperty(default=None) > > geohash = db.StringProperty() > > geoboxes = db.StringListProperty() > > > Background (details on geoboxing at bottom): > > We're running a mapreduce to change the geobox sizes/precision for a > > large number of entities. These entities currently have a 'geoboxes' > > StringListProperty with ~20 strings. For example: > > geoboxes = [u'37.341|-121.894|37.339|-121.892', u'37.341|-121.892| > > 37.339|-121.891', ...] > > We are changing those 20 strings to 20 new strings. Example: > > geoboxes = [u'37.3411|-121.8940|37.3395|-121.8926', > > u'37.3411|-121.8929|37.3395|-121.8916', ...] > > > The Cost: > > We did almost this same mapreduce when we first added the geoboxes > > back in July. In that case we were populating the list for the first > > time so we can assume half as many operations were required (no > > removing of old values). Total cost i July was ~$160 for the CPU > > time. > > > When we ran the mapreduce again this week to change the box sizes the > > cost was $18 for Frontend Instance Hours, $15 for Datastore Reads > > (21mil) and $2,500 for Datastore Writes (2500mil). This was not a > > complete run of the mapreduce. We aborted it after 5.4mil (38%) of > > the entities were updated. Hence Petey's estimate that the full > > update would cost $6,500. > > > The Operations: > > Each entity update is removing ~20 existing strings from the geoboxes > > StringList and adding 20 more. The geobox property is indexed (and > > has to be) and is involved in 3 composite indexes so as best I > > understand it this means each string change results in 10 writes (4 + > > 2 * 3). So on every entity we update the geoboxes we perform 401 > > write operations (1 + 10 * 40). > > > This agrees pretty well with the charges (2,500,000,000 ops / > > 5,424,000 entities) = 460 ops per entity. > > > That's a lot of writes and likely the core of the surprising cost. > > However, I'm not sure how we could avoid that with App Engine (open to > > ideas!), and since we could pay for dedicated servers for that amount, > > I think the pricing is probably off as well. > > > Even if we treat the geobox update as a one-time cost, we have other > > properties like scores, labels, etc that require occasional tweaking. > > Updating even a single indexed property across all these entities > > costs us $60-$100 and typically many times that in practice because > > these interesting fields tend to be used in composite indexes. > > > -Corey > > > Geoboxing Details > > Geoboxing is a technique used to search for entities near a point on > > the earth in a database that can only perform equality queries (like > > App Engine). In short, you break up the world into boxes and record > > which box each entity belongs to as well as any nearby boxes. Then > > you break up the world into larger boxes and repeat until you have a > > good range of sizes covered. > > There's a good article on the logic of algorithm > > here:http://code.google.com/appengine/articles/geosearch.html > > > On Jan 5, 11:58 am, "Ikai Lan (Google)" <[email protected]> wrote: > > > > Brian (apologies if that is not your name), > > > > How much of the costs are instance hours versus datastore writes? There's > > > probably something going on here. The largest costs are to update indexes, > > > not entities. Assuming $6500 is the cost of datastore writes alone, that > > > breaks down to: > > > > ~$0.0004 a write > > > > Pricing is $0.10 per 100k operations, so that means using this equation: > > > > (6500.00 / 14000000) / (0.10 / 100000) > > > > You're doing about 464 write operations per put, which roughly translates > > > to 6.5 billion writes. > > > > I'm trying to extrapolate what you are doing, and it sounds like you are > > > doing full text indexing or something similar ... and having to update all > > > the indexes. When you update a property, it takes a certain amount of > > > writes. Assuming you are changing String properties, each property you > > > update takes this many writes: > > > > - 2 indexes deleted (ascending and descending) > > > - 2 indexes update (ascending and descending) > > > > So if you were only updating all the list properties, that means you are > > > updating 100 list properties. > > > > Given that this is a regular thing you need to do, perhaps there is an > > > engineering solution for what you are trying to do that will be more cost > > > effective. Can you describe why you're running this job? What features > > > does > > > this support in your product? > > > > -- > > > Ikai Lan > > > Developer Programs Engineer, Google App Engine > > > plus.ikailan.com | twitter.com/ikai > > > > On Thu, Jan 5, 2012 at 10:08 AM, Petey <[email protected]> wrote: > > > > In this one case we had to change all of the items in the > > > > listproperty. In our most common case we might have to add and delete > > > > a couple items to the list property every once in a while. That would > > > > still cost us well over $1,000 each time. > > > > > Most of the reasons for this type of data in our product is to > > > > compensate for the fact that there isn't full text search yet. I know > > > > they are beta testing full text, but I'm still worried that that also > > > > might be too expensive per write. > > > > > On Jan 5, 6:54 am, Richard Watson <[email protected]> wrote: > > > > > A couple thoughts. > > > > > > Maybe the GAE team should borrow the idea of spot prices from Amazon. > > > > > That's a great way to have lower-priority jobs that can run when there > > > > are > > > > > instances available. We set the price we're willing to pay, if the > > > > > spot > > > > > cost drops below that, we get the resources. It creates a market where > > > > more > > > > > urgent jobs get done sooner and Google makes better use of quiet > > > > > periods. > > > > > > On your issue: > > > > > Do you need to update every entity when you do this? How many items on > > > > the > > > > > listproperty need to be changed? Could you tell us a bit more of what > > > > > the > > > > > data looks like? > > > > > > I'm thinking that 14 million entities x 18 items each is the amount of > > > > > entries you really have, each distributed across at least 3 servers > > > > > and > > > > > then indexed. That seems like a lot of writes if you're re-writing > > > > > everything. It's likely a bad idea to rely on an infrastructure > > > > > change > > > > to > > > > > fix this (recurring) issue, but there is hopefully a way to reduce the > > > > > amount of writes you have to do. > > > > > > Also, could you maybe run your mapreduce on smaller sets of the data > > > > > to > > > > > spread it out over multiple days and avoid adding too many instances? > > > > > Has > > > > > anyone done anything like this? > > > > > -- > > > > You received this message because you are subscribed to the Google > > > > Groups > > > > "Google App Engine" group. > > > > To post to this group, send email to [email protected]. > > > > To unsubscribe from this group, send email to > > > > [email protected]. > > > > For more options, visit this group at > > > >http://groups.google.com/group/google-appengine?hl=en. -- You received this message because you are subscribed to the Google Groups "Google App Engine" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/google-appengine?hl=en.
