Re: Clustering from DB

Jeff Eastman Thu, 02 Jul 2009 08:35:19 -0700

See inline comments:

nfantone wrote:

After some research and testing, I believe I can throw some light on
the subject. The runJob() static method defined in KMeansDriver
expects three file paths, referencing three different files with
different logical record's format; moreover, a "points" directory,
along with other files, are created as part of the output:


1) input

Description: A file containing data to be clustered, represented by Vectors.
Path: An absolute path to an HDFS data file.  Example: "input/thedata.dat"
Logical format: <ID, Vector>. The ID could be anything as long as it
extends Writable.

The logical format of input to KMeans is <Key, Vector> as it is insequence file format, but the Key is never used. To my knowledge, thereis no requirement to assign identifiers to the input points*. Users arefree to associate an arbitrary name field with each vector - also labelmappings may be assigned - but these are not manipulated by KMeans orany of the other clustering applications. The name field is now used asa vector identifier by the KMeansClusterMapper - if it is non-null - inthe output step only.

*MeanShift could certainly benefit from a requirement that all inputpoints have unique identifiers. Using the optional name field in thismanner seems pretty kludgy to me.

Code example (writing an input file):

// Get FileSystem through Configuration
Configuration conf = new Configuration();
Filesystem fs = FileSystem.get(conf);

// Instantiate writer to input data in a .dat file
// with a <Text, SparseVector> logical format
String fileName = "input/thedata.dat";
Path path = new Path(fileName);

SequenceFile.Writer seqVectorWriter = new SequenceFile.Writer(fs,
conf, path, Text.class, SparseVector.class);
VectorWriter writer = new SequenceFileVectorWriter(seqVectorWriter);

// Write Vectors to file. inputVectors could be any VectorIterable
implementation.
writer.write(inputVectors);
writer.close();

2) clustersIn

Description: A file containing the initial pre-computed (or randomly
selected) clusters to be used by kMeans. The 'k' value is determined
by the number of clusters in THIS file.
Path: An absolute path to a DIRECTORY containing any number of files
with a "part-xxxxx" name format, where 'x' is a one digit number. The
name should be omitted from the path. Example: "input/initial", where
'initial' has a "part-00000" file stored in it.
Logical format: <ID, ClusterBase>. The ID could be anything as long as
it extends Writable.

Again, the sequence file format requires an ID but this is not used.Each cluster has an internal ID in its state which is used by theimplementation. Typically, the ID is the same as the internal ID.

Code example (writing a clustersIn file):

// Get FileSystem through Configuration
Configuration conf = new Configuration();
Filesystem fs = FileSystem.get(conf);

// Instantiate writer to input clusters in a file with a <Text,
Cluster> logical format
String fileName = "input/initial/part-00000";
Path path = new Path(fileName);

SequenceFile.Writer seqClusterWriter = new SequenceFile.Writer(fs,
conf, path Text.class, Cluster.class);

// We choose 'k' random Vectors as centers for the initial clusters.
// 'inputVectors' could be any VectorIterable implementation.
// CANT_INITIAL_CLUSTERS is a desired integer value .
// The identifier of a Cluster is used as its ID.
// AFAICT, you DO NOT need to add the center as an actual point in the cluster,
// after cluster creation. This has been corrected recently.
int k = 0;
Iterator it = inputVectors.iterator();
while (it.hasNext() && k++ < CANT_INITIAL_CLUSTERS) {
        Vector v = (Vector)it.next();
        Cluster c = new Cluster(v);
        seqClusterWriter.append(new Text(c.getIdentifier()), c);
}
seqClusterWriter.close();

3) output

Description: The output files generated by the algorithm, in which the
results are stored. Directories named "clusters-i" -'i' being a
positive integer- are created. I'm not quite certain, but I believe
its nomenclature comes from the number of MAP/REDUCED tasks involved.
"part-00000" files are placed in those directories - they hold records
logically structured as <Text, Cluster>, each of which represent a
determined cluster in the dataset.

Each iteration produces a new set of clusters and these are stored in a"clusters-i" directory. The number of parts in each file is determinedby the number of reducers used by the clustering implementation. OnlyKMeans and Dirichlet allow more than one reducer. Dirichlet andMeanShift put all these iteration-generated files in a separate statedirectory in the output path. The nomenclature of these directories isnot standard and I see an improvement is needed.

Path: An absolute path to a parent directory for the "clusters-i"
directories. Example: "output".
Code example (reading and printing an output file):

// Get FileSystem through Configutaion
Configuration conf = new Configuration();
Filesystem fs = FileSystem.get(conf);

// Create a reader for a 'part-00000' file
Path outPath = new Path("output/clusters-0/part-00000");
SequenceFile.Reader reader  = new SequenceFile.Reader(fs, outPath, conf);

Writable key =  (Writable) reader.getKeyClass().newInstance();
Cluster value = new Cluster();
Vector center = null;

// Read file's records and print each cluster as 'Cluster: key {center}'
while (reader.next(key, value)) {
        System.out.println("Cluster: " + key + " { ");
        center = value.getCenter();

        for (int i = 0; i < center.size(); i++) {
                System.out.print(center.get(i) + " ");
        }
System.out.println(" }");

4) points

Description: A directory containing a "part-00000" file with a
<VectorID, CusterID> (both being Text type fields). It's basically an
index (with VectorID as key) that matches every Vector described in
the input ("thedata.dat" in our example) with the cluster they now
belong.
Logical format: <VectorID, ClusterID>. VectorID matches the ID
specified by the first field of each record int the input file.
ClusterID matches the ID in the first field of each "part-xxxxx"
included in a "clusters-i" directory.

The output points format has been recently changed from <ClusterID,Vector-asFormatString> to output either:<Vector.name, ClusterID> or <Vector.asFormatString, ClusterId> dependingupon if the points have been named or not.

The "TODO: This is ugly" comment in the Cluster code used for thiskludge is spot on.

Jeff

Re: Clustering from DB

Reply via email to