[ 
https://issues.apache.org/jira/browse/MAHOUT-552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12935402#action_12935402
 ] 

Jeff Eastman commented on MAHOUT-552:
-------------------------------------

This will not resolve your issue and is incorrect. The cluster centers are 
initialized from an input vector but subsequent observations of other input 
vectors will cause this to be recomputed to be the centroid of all the observed 
vectors. Any significance of retaining the first vector's NamedVector would be 
lost during this calculation. The cluster centroids are the results of many 
observations. Not correct to have them be named.

I think what you really want is to run the clustering job with the -cl option 
(not the default). This will compute the clusters into a clusters-n directory 
and then cluster (classify) all of the input vectors into a clusteredPoints 
directory. This directory will contain sequence files where the key is a 
clusterId and the values are WeightedVectorWritables. These will have a weight 
(1 in k-means & canopy, some value<1 for fuzzyK and Dirichlet) and your initial 
input vector. If that vector was a NamedVector then the output will also be a 
NamedVector, preserving your documentId.

> AbstractCluster eliminates NamedVectors by replacing them with 
> RandomAccessSparseVector always
> ----------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-552
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-552
>             Project: Mahout
>          Issue Type: Bug
>          Components: Clustering
>    Affects Versions: 0.5
>            Reporter: Pere Ferrera Bertran
>             Fix For: 0.5
>
>         Attachments: MAHOUT-552.patch
>
>
> When clustering using NamedVectors as input - after running seq2sparse with 
> patch https://issues.apache.org/jira/browse/MAHOUT-401 - names are lost 
> because AbstractCluster replaces vectors coming in the constructor with 
> RandomAccessSparseVector.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to