On Jun 26, 2009, at 10:20 AM, nfantone wrote:

Hi to you all, Mahout users. I'm new to the list and to Mahout itself
and I'm trying to integrate Taste to my project in which I need to
cluster user data from a very large data set, based on their behavior
which is stored in some tables in a local data base. From what I've
read and experimented, clustering in Mahout takes advantage of HDFS
and Lucene indexing, converting plain CSV files to Vectors. So, I ask:
is it mandatory to create plain text files (or HDFS files) and indexes
from the data in my DB so as to feed clustering algorithm's input?
Couldn't I create, somehow, the Vectors directly and then use them to
initiate the clustering jobs? Is there any convenient way to achieve
this? I've not seen anything similar to the "DataModel" interface used
by Recommenders for JDBC connection (or any other connectivity API)
and the runJob static methods receive paths for both input and output
which, a priori, I don't have any use for. Documentation wasn't
helpful either as the "From a Database" section of "Creating Vectors
from Text" is currently empty.

The clustering algorithms (on trunk) expect the input file to be a Hadoop SequenceFile of <Writable, Vector>

The utils module, contains an interface named VectorIterable which could easily be implemented to work with a JDBC connection. There is an implementation of this for Lucene (LuceneIterable). However, it is likely just as easy to write your own ResultSet loop that takes from your DB and outputs the SequenceFile. There are SequenceFile.Writer examples in several places in the utils module. See the Driver class in the utils module for example.

Also, FYI, Taste is a separate from what you seem to be implying you want to do. Taste is a collaborative filtering engine that lives in Mahout. Mahout also has several clustering implementations like k- Means, Canopy, Dirichlet, etc.

Reply via email to