On Jun 18, 2009, at 11:18 AM, Paul Jones wrote:
Okay I have brain freeze, reading the email below:-)
I think PLSI will do (or is a great starter) to what I want. I am
looking at a hadoop install, with mahout on top, is there any need
of lucene.
I haven't looked at the PLSI Pig thing yet, but I've been using the
Lucene stuff to produce Vectors from a Lucene index. So, if you
already have your own Vectors/Matrix, then no need for Lucene.
Also is there a "dummies" guide to all these algos, i.e which are
clustering algos, which are indexing, which are for "abc", since I
am reading a ton of information and am not 100% sure of which
categories they all fit into....hope the question is not to vague
The Wiki is the place to start. I've been working on http://cwiki.apache.org/confluence/display/MAHOUT/ClusteringYourData
, but it's far from complete. That will cover the clustering stuff.
As for indexing, not sure what you mean. If you're talking indexing
as in Lucene, there is no code for that.
FWIW, in answer to your original question, I've seen some people do
some interesting stuff with Graph Theory (ranking, etc.) and
relationships between words.
-Grant