I think Shashikant was using a modified form of Mahout that encoded the labels in the output.

I think we're still a little bit away from having a utility that truly makes this straightforward to go from text to clusterable vectors.

No doubt what is happening is the recognition of a need for some type of pipeline process that can work with multiple data sources and output various consumable formats and help select features. Unfortunately, we aren't there just yet.

-Grant

On May 29, 2009, at 11:27 AM, Benson Margulies wrote:

I'll fish for a one more hint. I'm using the MAHOUT-126 code to turn text into data via TF-IDF. What comes out of there is not in the same format as your example data. This means that I need a different InputDriver? Is one
lying about for the format written by that DocumentVector class?

On Fri, May 29, 2009 at 10:29 AM, Jeff Eastman
<[email protected]>wrote:

Benson Margulies wrote:

OK, I've got some inputs, I want to run k-means, how do I feed the beast?



Make sure you can run the Synthetic Control example to get everything wired
together correctly: JDK, Hadoop, Mahout. See
http://cwiki.apache.org/MAHOUT/syntheticcontroldata.html. Then write an
input job to convert your data similar to
/Mahout/examples/src/main/java/org/apache/mahout/clustering/ syntheticcontrol/canopy/InputDriver.java
and make a new job like
/Mahout/examples/src/main/java/org/apache/mahout/clustering/ syntheticcontrol/kmeans/Job.java.
You will have a small adventure and then be operational.

Have fun,
Jeff


Reply via email to