Hi Eli,

thanks for the reply! Replies inline:

> Your description of what you're doing and why isn't very complete.

I'm creating an index of at least 3 million documents which will grow
on a daily basis by a further 3000 documents. The index will be used
for searching the documents in interesting ways :)

> Let me re-state your points to see if I understand you correctly:
> 1.  About 3,000 new documents are processed each day.
> 2.  When a new document comes in, the task queue code you posted runs
> against that new document.  On average, a new document is associated with
> 3,500 index keys.
> What does gen_keys() do exactly?  How does it generate its list of db.Keys?

It's a trade secret :) Seriously though it just returns a sorted set
of db.Key instances where the numeric id of each key relates to a
hashing algorithm which had the document as the input. The sorted set
typically contains 3500 keys per call of gen_keys() per document. The
numeric id is a number bounded by the limits of a 32-bit unsigned int,
ie: 0<id<4,294,967,295.

> How big is the average indexes entity?

Depends on how many documents have been processed and how many
collisions have been found for that id, but a multiple of 4 bytes for
the v property. The statistical prediction would be that 3 million
documents would yield 10,500,000,000 keys, therefore yielding an
average of 2.44 documents per index entity. However, the chances of
collision are much higher in reality, so there will be missing ids
from the range and the average will be greater than 2.44.

> More than likely, I'm guessing that pulling all the associated document keys
> for the various index entities out of the datastore just to append a single
> key to the end of each array.. is wasting some resources.

Probably, but without the shuffle and sort capability of the map
reduce api finished, it's difficult to group for batches of inserts of
a single key.

> But, it's hard to tell without seeing more actual code.. and maybe getting a
> clearer understanding of why you're doing what you're doing.
> Thanks for additional info.
>

Hope that helps! I though relation index entites might help, but
they're currently too slow for big queries

https://groups.google.com/d/topic/google-appengine-python/t4vBnMH5J4M/discussion

Cheers,
Donovan.

> On Thu, Jan 6, 2011 at 3:22 PM, Donovan Hide <[email protected]> wrote:
>>
>> oops, task queue code should be:
>>
>> keys = gen_keys(document) // Builds a list of db.Key instances based
>> on the document
>> indexes=db.get(keys)
>> upserts=[]
>> for i,key in enumerate(indexes):
>>   if indexes[i] is None:
>>       upserts.append(I(key=keys[i],v=array('I',[document_id])))
>>   elif document_id not in indexes[i].v:
>>        indexes[i].v.append(document_id)
>>        upserts.append(indexes[i])
>> db.put(upserts)
>>
>> On 6 January 2011 19:36, Donovan <[email protected]> wrote:
>> > Hi,
>> >
>> > I'm using a very simple model to store arrays of document ids for an
>> > inverted index based on 3 million documents.
>> >
>> > class I(db.Model):
>> >    v=ArrayProperty(typecode="I",required=True)
>> >
>> > which uses:
>> >
>> >
>> > http://appengine-cookbook.appspot.com/recipe/store-arrays-of-numeric-values-efficiently-in-the-datastore/
>> >
>> > I have a simple task queue that includes the following piece of logic
>> > which loops 3,000 times a day, for new incoming documents which
>> > generate on average 3,500 keys each, to update the index:
>> >
>> > keys = gen_keys(document) // Builds a list of db.Key instances based
>> > on the document
>> > indexes=db.get(keys)
>> > upserts=[]
>> > for i,key in enumerate(indexes):
>> >    if indexes[i] is None:
>> >        upserts.append(I(key=keys[i],v=array('I',[document_id])))
>> >    elif news_article_id not in indexes[i].v:
>> >         indexes[i].v.append(document_id)
>> >         upserts.append(indexes[i])
>> > db.put(upserts)
>> >
>> > This loop leads to datastore CPU usage of 48 hours per 1000 documents
>> > which means a daily spend of $16.80 just for the datastore updates,
>> > which seems quite expensive given how something like Kyoto Cabinet
>> > running on conventional hosting could easily deal with this load. Does
>> > anyone have any ideas for minimizing the datastore CPU usage? My hunch
>> > is that the datastore CPU usage is a bit overpriced :(
>> >
>> > Cheers,
>> > Donovan.
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "Google App Engine" group.
>> To post to this group, send email to [email protected].
>> To unsubscribe from this group, send email to
>> [email protected].
>> For more options, visit this group at
>> http://groups.google.com/group/google-appengine?hl=en.
>>
>
> --
> You received this message because you are subscribed to the Google Groups
> "Google App Engine" group.
> To post to this group, send email to [email protected].
> To unsubscribe from this group, send email to
> [email protected].
> For more options, visit this group at
> http://groups.google.com/group/google-appengine?hl=en.
>

-- 
You received this message because you are subscribed to the Google Groups 
"Google App Engine" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/google-appengine?hl=en.

Reply via email to