Re: Clustering from DB

Grant Ingersoll Fri, 26 Jun 2009 08:42:17 -0700


On Jun 26, 2009, at 10:20 AM, nfantone wrote:

Hi to you all, Mahout users. I'm new to the list and to Mahout itself
and I'm trying to integrate Taste to my project in which I need to
cluster user data from a very large data set, based on their behavior
which is stored in some tables in a local data base. From what I've
read and experimented, clustering in Mahout takes advantage of HDFS
and Lucene indexing, converting plain CSV files to Vectors. So, I ask:
is it mandatory to create plain text files (or HDFS files) and indexes
from the data in my DB so as to feed clustering algorithm's input?
Couldn't I create, somehow, the Vectors directly and then use them to
initiate the clustering jobs? Is there any convenient way to achieve
this? I've not seen anything similar to the "DataModel" interface used
by Recommenders for JDBC connection (or any other connectivity API)
and the runJob static methods receive paths for both input and output
which, a priori, I don't have any use for. Documentation wasn't
helpful either as the "From a Database" section of "Creating Vectors
from Text" is currently empty.

The clustering algorithms (on trunk) expect the input file to be aHadoop SequenceFile of <Writable, Vector>

The utils module, contains an interface named VectorIterable whichcould easily be implemented to work with a JDBC connection. There isan implementation of this for Lucene (LuceneIterable). However, it islikely just as easy to write your own ResultSet loop that takes fromyour DB and outputs the SequenceFile. There are SequenceFile.Writerexamples in several places in the utils module. See the Driver classin the utils module for example.

Also, FYI, Taste is a separate from what you seem to be implying youwant to do. Taste is a collaborative filtering engine that lives inMahout. Mahout also has several clustering implementations like k-Means, Canopy, Dirichlet, etc.

Re: Clustering from DB

Reply via email to