Hi, The app we built is facing some scalability issues and i'm thinking there must be another way so i'm asking for help.
Basically we fetch the several rss's and classify each article using bayes algorithm. The problem is that if a given article has too many words/features then the task will take too long and be terminated. This is how the classification process happens: 1. Fetch last unprocessed document from datastore 2. For *each word* in the document see if it's in datastore and/or memcache to get its classification (The bayes algorithm was implemented using the code in Programming Collective Intelligence) 3. Process all word classifications to assign an overall tag to the document (eg. good,bad, neutral) 4. Save document classification in datastore and memcache The problem lies in step 2. Initially i thought the solution might be to batch get queries but it's not possible since i don't have the corresponding keys so now i'm lost as to how to improve performance. This article http://answers.oreilly.com/topic/2427-how-to-speed-up-machine-learning-using-a-set-oriented-approach/ outlines how they solved the problem using a set oriented approach (postgres). From what i understand they fetched all words in the database and then just checked if the words in the document were there but i'm not sure that's really how they did it. I can't fetch all classified words from the datastore/memcache since there's a limit of 1000 records and it seems a less than optimal solution. Can anyone offer some advice? Thanks, Sofia -- You received this message because you are subscribed to the Google Groups "Google App Engine" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/google-appengine?hl=en.
