See inline comments:
nfantone wrote:
After some research and testing, I believe I can throw some light on
the subject. The runJob() static method defined in KMeansDriver
expects three file paths, referencing three different files with
different logical record's format; moreover, a "points" directory,
along with other files, are created as part of the output:
1) input
Description: A file containing data to be clustered, represented by Vectors.
Path: An absolute path to an HDFS data file. Example: "input/thedata.dat"
Logical format: <ID, Vector>. The ID could be anything as long as it
extends Writable.
The logical format of input to KMeans is <Key, Vector> as it is in
sequence file format, but the Key is never used. To my knowledge, there
is no requirement to assign identifiers to the input points*. Users are
free to associate an arbitrary name field with each vector - also label
mappings may be assigned - but these are not manipulated by KMeans or
any of the other clustering applications. The name field is now used as
a vector identifier by the KMeansClusterMapper - if it is non-null - in
the output step only.
*MeanShift could certainly benefit from a requirement that all input
points have unique identifiers. Using the optional name field in this
manner seems pretty kludgy to me.
Code example (writing an input file):
// Get FileSystem through Configuration
Configuration conf = new Configuration();
Filesystem fs = FileSystem.get(conf);
// Instantiate writer to input data in a .dat file
// with a <Text, SparseVector> logical format
String fileName = "input/thedata.dat";
Path path = new Path(fileName);
SequenceFile.Writer seqVectorWriter = new SequenceFile.Writer(fs,
conf, path, Text.class, SparseVector.class);
VectorWriter writer = new SequenceFileVectorWriter(seqVectorWriter);
// Write Vectors to file. inputVectors could be any VectorIterable
implementation.
writer.write(inputVectors);
writer.close();
2) clustersIn
Description: A file containing the initial pre-computed (or randomly
selected) clusters to be used by kMeans. The 'k' value is determined
by the number of clusters in THIS file.
Path: An absolute path to a DIRECTORY containing any number of files
with a "part-xxxxx" name format, where 'x' is a one digit number. The
name should be omitted from the path. Example: "input/initial", where
'initial' has a "part-00000" file stored in it.
Logical format: <ID, ClusterBase>. The ID could be anything as long as
it extends Writable.
Again, the sequence file format requires an ID but this is not used.
Each cluster has an internal ID in its state which is used by the
implementation. Typically, the ID is the same as the internal ID.
Code example (writing a clustersIn file):
// Get FileSystem through Configuration
Configuration conf = new Configuration();
Filesystem fs = FileSystem.get(conf);
// Instantiate writer to input clusters in a file with a <Text,
Cluster> logical format
String fileName = "input/initial/part-00000";
Path path = new Path(fileName);
SequenceFile.Writer seqClusterWriter = new SequenceFile.Writer(fs,
conf, path Text.class, Cluster.class);
// We choose 'k' random Vectors as centers for the initial clusters.
// 'inputVectors' could be any VectorIterable implementation.
// CANT_INITIAL_CLUSTERS is a desired integer value .
// The identifier of a Cluster is used as its ID.
// AFAICT, you DO NOT need to add the center as an actual point in the cluster,
// after cluster creation. This has been corrected recently.
int k = 0;
Iterator it = inputVectors.iterator();
while (it.hasNext() && k++ < CANT_INITIAL_CLUSTERS) {
Vector v = (Vector)it.next();
Cluster c = new Cluster(v);
seqClusterWriter.append(new Text(c.getIdentifier()), c);
}
seqClusterWriter.close();
3) output
Description: The output files generated by the algorithm, in which the
results are stored. Directories named "clusters-i" -'i' being a
positive integer- are created. I'm not quite certain, but I believe
its nomenclature comes from the number of MAP/REDUCED tasks involved.
"part-00000" files are placed in those directories - they hold records
logically structured as <Text, Cluster>, each of which represent a
determined cluster in the dataset.
Each iteration produces a new set of clusters and these are stored in a
"clusters-i" directory. The number of parts in each file is determined
by the number of reducers used by the clustering implementation. Only
KMeans and Dirichlet allow more than one reducer. Dirichlet and
MeanShift put all these iteration-generated files in a separate state
directory in the output path. The nomenclature of these directories is
not standard and I see an improvement is needed.
Path: An absolute path to a parent directory for the "clusters-i"
directories. Example: "output".
Code example (reading and printing an output file):
// Get FileSystem through Configutaion
Configuration conf = new Configuration();
Filesystem fs = FileSystem.get(conf);
// Create a reader for a 'part-00000' file
Path outPath = new Path("output/clusters-0/part-00000");
SequenceFile.Reader reader = new SequenceFile.Reader(fs, outPath, conf);
Writable key = (Writable) reader.getKeyClass().newInstance();
Cluster value = new Cluster();
Vector center = null;
// Read file's records and print each cluster as 'Cluster: key {center}'
while (reader.next(key, value)) {
System.out.println("Cluster: " + key + " { ");
center = value.getCenter();
for (int i = 0; i < center.size(); i++) {
System.out.print(center.get(i) + " ");
}
System.out.println(" }");
4) points
Description: A directory containing a "part-00000" file with a
<VectorID, CusterID> (both being Text type fields). It's basically an
index (with VectorID as key) that matches every Vector described in
the input ("thedata.dat" in our example) with the cluster they now
belong.
Logical format: <VectorID, ClusterID>. VectorID matches the ID
specified by the first field of each record int the input file.
ClusterID matches the ID in the first field of each "part-xxxxx"
included in a "clusters-i" directory.
The output points format has been recently changed from <ClusterID,
Vector-asFormatString> to output either:
<Vector.name, ClusterID> or <Vector.asFormatString, ClusterId> depending
upon if the points have been named or not.
The "TODO: This is ugly" comment in the Cluster code used for this
kludge is spot on.
Jeff