Hi,
I'm using a very simple model to store arrays of document ids for an
inverted index based on 3 million documents.
class I(db.Model):
v=ArrayProperty(typecode="I",required=True)
which uses:
http://appengine-cookbook.appspot.com/recipe/store-arrays-of-numeric-values-efficiently-in-the-datastore/
I have a simple task queue that includes the following piece of logic
which loops 3,000 times a day, for new incoming documents which
generate on average 3,500 keys each, to update the index:
keys = gen_keys(document) // Builds a list of db.Key instances based
on the document
indexes=db.get(keys)
upserts=[]
for i,key in enumerate(indexes):
if indexes[i] is None:
upserts.append(I(key=keys[i],v=array('I',[document_id])))
elif news_article_id not in indexes[i].v:
indexes[i].v.append(document_id)
upserts.append(indexes[i])
db.put(upserts)
This loop leads to datastore CPU usage of 48 hours per 1000 documents
which means a daily spend of $16.80 just for the datastore updates,
which seems quite expensive given how something like Kyoto Cabinet
running on conventional hosting could easily deal with this load. Does
anyone have any ideas for minimizing the datastore CPU usage? My hunch
is that the datastore CPU usage is a bit overpriced :(
Cheers,
Donovan.
--
You received this message because you are subscribed to the Google Groups
"Google App Engine" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to
[email protected].
For more options, visit this group at
http://groups.google.com/group/google-appengine?hl=en.