Re: Clustering from DB

nfantone Thu, 02 Jul 2009 11:17:27 -0700

Thanks for the feedback, Jeff.

> The logical format of input to KMeans is <Key, Vector> as it is in sequence
> file format, but the Key is never used. To my knowledge, there is no
> requirement to assign identifiers to the input points*. Users are free to
> associate an arbitrary name field with each vector - also label mappings may
> be assigned - but these are not manipulated by KMeans or any of the other
> clustering applications. The name field is now used as a vector identifier
> by the KMeansClusterMapper - if it is non-null - in the output step only.


The key may not be used internally, but externally they can prove to
be pretty useful. For me, keys are userIDs and each Vector represents
his/her historical behavior. Being able to collect the output
information as <UserID, ClusterID> is quite neat as it allows me to,
for instance, retrieve user information using data directly from a
HDFS file's field.

Re: Clustering from DB

Reply via email to