I was going to suggest YQL term-extraction, which is quite good. But be sure to update on today's news regarding Y! & MS, which makes any usage of Yahoo!'s API's a very risky bet.
Udi On 7/29/09, Refael <[email protected]> wrote: > > I think I have a jackpot > > Using Yahoo Term extractor on a random 20 articles I get: > [(u'django', 14), > (u'google', 4), > (u'python', 3), > (u'oxford', 2), > (u'ruby on rails', 2), > (u'migrations', 2), > (u'apps', 2), > (u'iteration', 2), > (u'snippets', 2), > (u'models', 2), > (u'running', 2), > (u'google maps', 2), > (u'unit tests', 1), > (u'geek night', 1), > (u'ali', 1), > (u'dba', 1), > (u'celebrities', 1), > (u'data models', 1), > (u'vagas para', 1), > (u'admin interface', 1), > (u'nbsp', 1), > (u'password xxx', 1), > (u'internet explorer', 1), > (u'volta', 1), > (u's\xe3o paulo', 1), > (u'long time', 1), > (u'larson', 1), > (u'staging', 1), > (u'capabilities', 1), > (u'blog', 1), > (u'pra valer', 1), > (u'dict', 1), > (u'search software', 1), > (u'advice', 1), > (u'interactive map', 1), > (u'crash', 1), > (u'banco de dados', 1), > (u'keyword arguments', 1), > (u'export library', 1), > (u'core management', 1), > (u'fantasy sport', 1), > (u'submission', 1), > (u'foi', 1), > (u'html javascript', 1), > (u'last time', 1), > (u'cms', 1), > (u'database name', 1), > (u'enthusiasts', 1), > (u'map', 1), > (u'cairo', 1), > (u'creation', 1), > (u'sync', 1), > (u'meta', 1), > (u'sem', 1), > (u'inkscape', 1), > (u'pylons', 1), > (u'pdf export', 1), > (u'abc', 1), > (u'install software', 1), > (u'exit 1', 1), > (u'uma', 1), > (u'irc', 1), > (u'dias', 1), > (u'exercise', 1), > (u'best project', 1), > (u'time one', 1), > (u'reason', 1), > (u'interface', 1), > (u'webapp', 1), > (u'bottom line', 1), > (u'database engine', 1), > (u'friends houses', 1), > (u'looking at the environment', 1), > (u'launch', 1), > (u'content types', 1), > (u'ajax', 1), > (u'discussion groups', 1), > (u'new game', 1), > (u'new features', 1), > (u'aptitude', 1), > (u'para quem', 1), > (u'fun parties', 1), > (u'few days', 1), > (u'jay graves', 1), > (u'interact', 1), > (u'private league', 1), > (u'lot', 1), > (u'hollywood', 1), > (u'checkout', 1), > (u'public presentation', 1), > (u'game model', 1), > (u'fun things', 1), > (u'south project', 1), > (u'slides', 1), > (u'freelancer', 1), > (u'object oriented', 1), > (u'sphinx', 1), > (u'insights', 1), > (u'scratchpad', 1), > (u'initial release', 1), > (u'rio', 1), > (u'super models', 1), > (u'presentation program', 1), > (u'browser', 1), > (u'debugging', 1), > (u'positive reaction', 1), > (u'initial development', 1), > (u'business logic', 1), > (u'representative locator', 1), > (u'traditional fantasy', 1), > (u'implementations', 1), > (u'raw', 1), > (u'absolute url', 1), > (u'o tempo', 1), > (u'technology', 1), > (u'greenpeace', 1), > (u'html css', 1), > (u'pdftk', 1), > (u'line test', 1), > (u'nas', 1), > (u'functionality', 1), > (u'import user', 1), > (u'sqlite3', 1), > (u'webdesigner', 1), > (u'server os', 1), > (u'record time', 1), > (u'quote', 1), > (u'first installment', 1), > (u'test automation conference', 1), > (u'sys', 1), > (u'fantasy game', 1), > (u'pool', 1), > (u'first name last name', 1), > (u'design patterns', 1), > (u'modes', 1), > (u'driven development', 1), > (u'os system', 1), > (u'databases', 1), > (u'output variables', 1), > (u'cookbook', 1) > ] > > On Jul 13, 7:49 pm, Imri Goldberg <[email protected]> wrote: >> My shneckel: >> 1. Have a simple cull list (take the 5 minutes to write it, and it will do >> 80% of the work)2. Use TF/IDF >> >> >> >> On Mon, Jul 13, 2009 at 7:02 PM, Refael <[email protected]> wrote: >> >> > I've run the data trough Whoosh, and now the hardest part is to cull >> > the words. >> > For example these are the top 10 word counts: >> > (u'django', 15051), >> > (u'have', 4066), >> > (u'your', 3770), >> > (u'us', 3311), >> > (u'python', 2738), >> > (u'some', 2713), >> > (u'site', 2501), >> > (u'code', 2359), >> > (u'like', 2335), >> > (u'project', 2327), >> >> > Any ideas how to sort out relevant tags? >> >> > On Jun 25, 4:36 pm, benny daon <[email protected]> wrote: >> > > Hi all,I've got a project going with the aim of improving >> > djangoproject.com. >> > > So far I've forked the original code, cleaned it up, added buildout so >> > > installation will be a breeze, and added django-south so we can easily >> > > upgrade the database. >> > > Jacob KM sent me a link to a dump of the current database which I >> > included >> > > in the migration script so the code pulls the dump and use it to >> > > create >> > the >> > > database and add all the rows. There are almost 5000 rows in the >> > > model, >> > > pointing to django related posts. The next step is to extract common >> > > tags >> > > from the title and summary fields of the FeedItem. >> > > A friend recommended I use Solr or Lucene for this job which makes >> > > sense. >> > My >> > > issue is that I never used them before. If you know what needs to be >> > > done >> > > and have some time, please assign this ticket - >> >http://bitbucket.org/daonb/django-website/issue/3/-to yourself, fork the >> > > code, do it, and send me a 'pull request'. >> >> > > Thanks, >> >> > > Benny. >> >> > > BTW - there's much more to do in this project. Please feel free to >> > > open >> > > tickets with suggestions/bugs or better yet - send code. Jacob said he >> > will >> > > use it in the live site. >> >> -- >> Imri Goldberg >> --------------------------------------www.algorithm.co.il/blogs/ >> -------------------------------------- >> -- insert signature here ---- > > > --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "PyWeb-IL" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/pyweb-il?hl=en -~----------~----~----~----~------~----~------~--~--- _______________________________________________ Python-il mailing list [email protected] http://hamakor.org.il/cgi-bin/mailman/listinfo/python-il
