I think Shashikant was using a modified form of Mahout that encoded
the labels in the output.
I think we're still a little bit away from having a utility that truly
makes this straightforward to go from text to clusterable vectors.
No doubt what is happening is the recognition of a need for some type
of pipeline process that can work with multiple data sources and
output various consumable formats and help select features.
Unfortunately, we aren't there just yet.
-Grant
On May 29, 2009, at 11:27 AM, Benson Margulies wrote:
I'll fish for a one more hint. I'm using the MAHOUT-126 code to turn
text
into data via TF-IDF. What comes out of there is not in the same
format as
your example data. This means that I need a different InputDriver?
Is one
lying about for the format written by that DocumentVector class?
On Fri, May 29, 2009 at 10:29 AM, Jeff Eastman
<[email protected]>wrote:
Benson Margulies wrote:
OK, I've got some inputs, I want to run k-means, how do I feed the
beast?
Make sure you can run the Synthetic Control example to get
everything wired
together correctly: JDK, Hadoop, Mahout. See
http://cwiki.apache.org/MAHOUT/syntheticcontroldata.html. Then
write an
input job to convert your data similar to
/Mahout/examples/src/main/java/org/apache/mahout/clustering/
syntheticcontrol/canopy/InputDriver.java
and make a new job like
/Mahout/examples/src/main/java/org/apache/mahout/clustering/
syntheticcontrol/kmeans/Job.java.
You will have a small adventure and then be operational.
Have fun,
Jeff