Had a look at it sometime ago, but admitedly skimmed over it. Just read it again, looks good, allows dimension reduction with ease, and hence looks scalable.
tks Paul ________________________________ From: Grant Ingersoll <gsing...@apache.org> To: mahout-user@lucene.apache.org Sent: Wednesday, 24 June, 2009 12:34:46 Subject: Re: mahout PLSI (with some lucene, thrown in) Random FYI: http://code.google.com/p/semanticvectors/ came up on the Lucene mailing list yesterday and it sounds interesting, plus BSD license... -Grant On Jun 23, 2009, at 7:56 PM, Paul Jones wrote: > Yup, I see that wordnet has also been "ported" to a lucene index, and hence > pulling the hyponyms works great. > > tks > > Paul > > > > > ________________________________ > From: Tommy Chheng <to...@peoplejar.com> > To: mahout-user@lucene.apache.org > Sent: Tuesday, 23 June, 2009 23:19:25 > Subject: Re: mahout PLSI (with some lucene, thrown in) > > Have you looked at WordNet to get the hypohyms? > > Tommy > > On Jun 23, 2009, at 3:09 PM, Paul Jones wrote: > >> Okay, have seen the difficulty (apart from the maths :-)). >> >> I guess "similar" can mean many things, i.e hypohyms, but also words such as >> hot...cold are also "related", hence to solve my little problem I am >> wondering if there is a easier way, i.e to use things like existing hyponyms >> relations which exist (wordnet and the like) , and/or if they do not then I >> guess using something similar to a "google distance measure" may help in >> "adding" new words to the system.... >> >> Paul >> >> >> >> >> ________________________________ >> From: Ted Dunning <ted.dunn...@gmail.com> >> To: mahout-user@lucene.apache.org >> Sent: Tuesday, 23 June, 2009 18:00:12 >> Subject: Re: mahout PLSI (with some lucene, thrown in) >> >> Yes. This can be done. It isn't necessarily real simple to do. >> >> See http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.56.7275 for an >> old (but still pretty good) example. >> >> On Tue, Jun 23, 2009 at 6:45 AM, Paul Jones <paul_jone...@yahoo.co.uk>wrote: >> >>> Imagine we have crawled 100K webpages, and we have 100 pages which show >>> "red" and 100 which show "crimson" and then 100 which show both "red and >>> crimson" is there a way to deduce that there maybe (albeit weak) >>> relationship between red AND crimson. Of course we can pre-seed this info, >>> which then gets weighted by actual results. >>> >> >> >> > > -------------------------- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using Solr/Lucene: http://www.lucidimagination.com/search