It's looking for the initial set of k, clusters which are input to the
algorithm. Did you run Canopy to create them or do you have another
sampling technique to initialize these clusters?
Jeff
nfantone wrote:
Fellows, today I updated to revision 791558 and while running kMeans I
got the following exception:
WARNING: java.io.FileNotFoundException: File
output/clusters-0/part-00000/* does not exist.
java.io.FileNotFoundException: File output/clusters-0/part-00000/*
does not exist.
The algorithm isn't interrupted, though. But this exception wasn't
thrown before the update and, to me, its message is not quite clear.
It seems as it's looking for any file inside a "part-00000" directory,
which doesn't exist; and, as far as I know, "part-xxxxx" are default
names for output files.
I could show the entire stack trace, if needed. Any pointers?
On Thu, Jul 2, 2009 at 3:16 PM, nfantone<[email protected]> wrote:
Thanks for the feedback, Jeff.
The logical format of input to KMeans is <Key, Vector> as it is in sequence
file format, but the Key is never used. To my knowledge, there is no
requirement to assign identifiers to the input points*. Users are free to
associate an arbitrary name field with each vector - also label mappings may
be assigned - but these are not manipulated by KMeans or any of the other
clustering applications. The name field is now used as a vector identifier
by the KMeansClusterMapper - if it is non-null - in the output step only.
The key may not be used internally, but externally they can prove to
be pretty useful. For me, keys are userIDs and each Vector represents
his/her historical behavior. Being able to collect the output
information as <UserID, ClusterID> is quite neat as it allows me to,
for instance, retrieve user information using data directly from a
HDFS file's field.