Hi,

The app we built is facing some scalability issues and i'm thinking there 
must be another way so i'm asking for help. 

Basically we fetch the several rss's and classify each article using bayes 
algorithm. The problem is that if a given article has too many 
words/features then the task will take too long and be terminated.

This is how the classification process happens:

   1. Fetch last unprocessed document from datastore
   2. For *each word* in the document see if it's in datastore and/or 
   memcache to get its classification (The bayes algorithm was implemented 
   using the code in Programming Collective Intelligence)
   3. Process all word classifications to assign an overall tag to the 
   document (eg. good,bad, neutral)
   4. Save document classification in datastore and memcache

The problem lies in step 2. Initially i thought the solution might be to 
batch get queries but it's not possible since i don't have the corresponding 
keys so now i'm lost as to how to improve performance. 

This article 
http://answers.oreilly.com/topic/2427-how-to-speed-up-machine-learning-using-a-set-oriented-approach/
 
outlines how they solved the problem using a set oriented approach 
(postgres). From what i understand they fetched all words in the database 
and then just checked if the words in the document were there but i'm not 
sure that's really how they did it. I can't fetch all classified words from 
the datastore/memcache since there's a limit of 1000 records and it seems a 
less than optimal solution.

Can anyone offer some advice?

Thanks,

Sofia

-- 
You received this message because you are subscribed to the Google Groups 
"Google App Engine" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/google-appengine?hl=en.

Reply via email to