Corey, Did you guys consider something along the lines of SimpleGeo to outsource your spatial stuff?
Is there a political or philosophical reason to keep everything inside of GAE? -- George On Jan 5, 3:24 pm, "Corey [Firespotter]" <[email protected]> wrote: > I work with Petey on this and can help clarify some of the details. > > The Entities; > We have a lot of entities (~14mi) each of which have a > StringListProperty called "geoboxes". Like so: > class Place(search.SearchableModel): > name = db.StringProperty() > ... > # Location specific fields. > coordinates = db.GeoPtProperty(default=None) > geohash = db.StringProperty() > geoboxes = db.StringListProperty() > > Background (details on geoboxing at bottom): > We're running a mapreduce to change the geobox sizes/precision for a > large number of entities. These entities currently have a 'geoboxes' > StringListProperty with ~20 strings. For example: > geoboxes = [u'37.341|-121.894|37.339|-121.892', u'37.341|-121.892| > 37.339|-121.891', ...] > We are changing those 20 strings to 20 new strings. Example: > geoboxes = [u'37.3411|-121.8940|37.3395|-121.8926', > u'37.3411|-121.8929|37.3395|-121.8916', ...] > > The Cost: > We did almost this same mapreduce when we first added the geoboxes > back in July. In that case we were populating the list for the first > time so we can assume half as many operations were required (no > removing of old values). Total cost i July was ~$160 for the CPU > time. > > When we ran the mapreduce again this week to change the box sizes the > cost was $18 for Frontend Instance Hours, $15 for Datastore Reads > (21mil) and $2,500 for Datastore Writes (2500mil). This was not a > complete run of the mapreduce. We aborted it after 5.4mil (38%) of > the entities were updated. Hence Petey's estimate that the full > update would cost $6,500. > > The Operations: > Each entity update is removing ~20 existing strings from the geoboxes > StringList and adding 20 more. The geobox property is indexed (and > has to be) and is involved in 3 composite indexes so as best I > understand it this means each string change results in 10 writes (4 + > 2 * 3). So on every entity we update the geoboxes we perform 401 > write operations (1 + 10 * 40). > > This agrees pretty well with the charges (2,500,000,000 ops / > 5,424,000 entities) = 460 ops per entity. > > That's a lot of writes and likely the core of the surprising cost. > However, I'm not sure how we could avoid that with App Engine (open to > ideas!), and since we could pay for dedicated servers for that amount, > I think the pricing is probably off as well. > > Even if we treat the geobox update as a one-time cost, we have other > properties like scores, labels, etc that require occasional tweaking. > Updating even a single indexed property across all these entities > costs us $60-$100 and typically many times that in practice because > these interesting fields tend to be used in composite indexes. > > -Corey > > Geoboxing Details > Geoboxing is a technique used to search for entities near a point on > the earth in a database that can only perform equality queries (like > App Engine). In short, you break up the world into boxes and record > which box each entity belongs to as well as any nearby boxes. Then > you break up the world into larger boxes and repeat until you have a > good range of sizes covered. > There's a good article on the logic of algorithm > here:http://code.google.com/appengine/articles/geosearch.html > > On Jan 5, 11:58 am, "Ikai Lan (Google)" <[email protected]> wrote: > > > > > > > > > Brian (apologies if that is not your name), > > > How much of the costs are instance hours versus datastore writes? There's > > probably something going on here. The largest costs are to update indexes, > > not entities. Assuming $6500 is the cost of datastore writes alone, that > > breaks down to: > > > ~$0.0004 a write > > > Pricing is $0.10 per 100k operations, so that means using this equation: > > > (6500.00 / 14000000) / (0.10 / 100000) > > > You're doing about 464 write operations per put, which roughly translates > > to 6.5 billion writes. > > > I'm trying to extrapolate what you are doing, and it sounds like you are > > doing full text indexing or something similar ... and having to update all > > the indexes. When you update a property, it takes a certain amount of > > writes. Assuming you are changing String properties, each property you > > update takes this many writes: > > > - 2 indexes deleted (ascending and descending) > > - 2 indexes update (ascending and descending) > > > So if you were only updating all the list properties, that means you are > > updating 100 list properties. > > > Given that this is a regular thing you need to do, perhaps there is an > > engineering solution for what you are trying to do that will be more cost > > effective. Can you describe why you're running this job? What features does > > this support in your product? > > > -- > > Ikai Lan > > Developer Programs Engineer, Google App Engine > > plus.ikailan.com | twitter.com/ikai > > > On Thu, Jan 5, 2012 at 10:08 AM, Petey <[email protected]> wrote: > > > In this one case we had to change all of the items in the > > > listproperty. In our most common case we might have to add and delete > > > a couple items to the list property every once in a while. That would > > > still cost us well over $1,000 each time. > > > > Most of the reasons for this type of data in our product is to > > > compensate for the fact that there isn't full text search yet. I know > > > they are beta testing full text, but I'm still worried that that also > > > might be too expensive per write. > > > > On Jan 5, 6:54 am, Richard Watson <[email protected]> wrote: > > > > A couple thoughts. > > > > > Maybe the GAE team should borrow the idea of spot prices from Amazon. > > > > That's a great way to have lower-priority jobs that can run when there > > > are > > > > instances available. We set the price we're willing to pay, if the spot > > > > cost drops below that, we get the resources. It creates a market where > > > more > > > > urgent jobs get done sooner and Google makes better use of quiet > > > > periods. > > > > > On your issue: > > > > Do you need to update every entity when you do this? How many items on > > > the > > > > listproperty need to be changed? Could you tell us a bit more of what > > > > the > > > > data looks like? > > > > > I'm thinking that 14 million entities x 18 items each is the amount of > > > > entries you really have, each distributed across at least 3 servers and > > > > then indexed. That seems like a lot of writes if you're re-writing > > > > everything. It's likely a bad idea to rely on an infrastructure change > > > to > > > > fix this (recurring) issue, but there is hopefully a way to reduce the > > > > amount of writes you have to do. > > > > > Also, could you maybe run your mapreduce on smaller sets of the data to > > > > spread it out over multiple days and avoid adding too many instances? > > > > Has > > > > anyone done anything like this? > > > > -- > > > You received this message because you are subscribed to the Google Groups > > > "Google App Engine" group. > > > To post to this group, send email to [email protected]. > > > To unsubscribe from this group, send email to > > > [email protected]. > > > For more options, visit this group at > > >http://groups.google.com/group/google-appengine?hl=en. -- You received this message because you are subscribed to the Google Groups "Google App Engine" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/google-appengine?hl=en.
