[google-appengine] faster machine learning on appengine (nlp bayes)

sofia Wed, 02 Feb 2011 14:40:23 -0800

Hi,

The app we built is facing some scalability issues and i'm thinking there 
must be another way so i'm asking for help.

Basically we fetch the several rss's and classify each article using bayes
algorithm. The problem is that if a given article has too many
words/features then the task will take too long and be terminated.

This is how the classification process happens:

1. Fetch last unprocessed document from datastore
2. For *each word* in the document see if it's in datastore and/or
memcache to get its classification (The bayes algorithm was implemented
using the code in Programming Collective Intelligence)
3. Process all word classifications to assign an overall tag to the
document (eg. good,bad, neutral)
4. Save document classification in datastore and memcache

The problem lies in step 2. Initially i thought the solution might be to
batch get queries but it's not possible since i don't have the corresponding
keys so now i'm lost as to how to improve performance.

This article
http://answers.oreilly.com/topic/2427-how-to-speed-up-machine-learning-using-a-set-oriented-approach/

outlines how they solved the problem using a set oriented approach
(postgres). From what i understand they fetched all words in the database
and then just checked if the words in the document were there but i'm not
sure that's really how they did it. I can't fetch all classified words from
the datastore/memcache since there's a limit of 1000 records and it seems a
less than optimal solution.

Can anyone offer some advice?

Thanks,

Sofia

--
You received this message because you are subscribed to the Google Groups
"Google App Engine" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to
[email protected].
For more options, visit this group at
http://groups.google.com/group/google-appengine?hl=en.

[google-appengine] faster machine learning on appengine (nlp bayes)

Reply via email to