Thanks for the feedback, Jeff. > The logical format of input to KMeans is <Key, Vector> as it is in sequence > file format, but the Key is never used. To my knowledge, there is no > requirement to assign identifiers to the input points*. Users are free to > associate an arbitrary name field with each vector - also label mappings may > be assigned - but these are not manipulated by KMeans or any of the other > clustering applications. The name field is now used as a vector identifier > by the KMeansClusterMapper - if it is non-null - in the output step only.
The key may not be used internally, but externally they can prove to be pretty useful. For me, keys are userIDs and each Vector represents his/her historical behavior. Being able to collect the output information as <UserID, ClusterID> is quite neat as it allows me to, for instance, retrieve user information using data directly from a HDFS file's field.
