[ 
https://issues.apache.org/jira/browse/MAHOUT-1030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13881226#comment-13881226
 ] 

Pat Ferrel commented on MAHOUT-1030:
------------------------------------

This fixes a very literal reading of the bug. The distance-squared is indeed 
included in clusteredPoints BUT there are no vector ids so the distance can't 
actually be used. Without a vector id in clusteredPoints, Mahout doesn't really 
perform unsupervised categorization. I will now have to loop through all 
vectors, recalculate the distance and categorize them according to the cluster 
centroid they are closest to. 

The clusteredPoints and distance-squared can't actually be used without knowing 
the vector id. I think named vectors work here but many cases including mine do 
not have names only Mahout integer ids.

Please correct me if I've missed something.

When I cluster the user preference data used in the Mahout recommender I get 
clusteredPoints something like this. The data from the vector is given but not 
its id??? The Key here is a cluster id.

pat$ mahout seqdumper -i /Users/pat/big-data/temp/clusters/clusteredPoints/ | 
more
Jan 24, 2014 10:02:05 AM org.slf4j.impl.JCLLoggerAdapter info
INFO: Command line arguments: {--endPhase=[2147483647], 
--input=[/Users/pat/big-data/temp/clusters/clusteredPoints/], --startPhase=[0], 
--tempDir=[temp]}
2014-01-24 10:02:05.707 java[29221:1003] Unable to load realm info from 
SCDynamicStore
Input Path: file:/Users/pat/big-data/temp/clusters/clusteredPoints/part-m-00000
Key class: class org.apache.hadoop.io.IntWritable Value Class: class 
org.apache.mahout.clustering.classify.WeightedPropertyVectorWritable
Key: 39: Value: wt: 1.0 distance-squared: 9.656875  vec: [0:1.000, 2:1.000, 
5:1.000, 9:1.000, 12:1.000, 13:1.000, 17:1.000, 18:1.000, 19:1.000, 20:1.000]
Key: 48: Value: wt: 1.0 distance-squared: 22.229166666666686  vec: [25:1.000, 
26:1.000, 27:1.000, 28:1.000, 29:1.000, 30:1.000, 31:1.000, 36:1.000, 38:1.000, 
39:1.000, 40:1.000, 41:1.000, 43:1.000, 44:1.000, 46:1.000, 48:1.000, 53:1.000, 
54:1.000, 55:1.000, 56:1.000, 57:1.000, 58:1.000, 60:1.000, 63:1.000, 64:1.000, 
66:1.000, 67:1.000, 68:1.000, 69:1.000, 70:1.000, 71:1.000, 72:1.000]


> Regression: Clustered Points Should be WeightedPropertyVectorWritable not 
> WeightedVectorWritable
> ------------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-1030
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1030
>             Project: Mahout
>          Issue Type: Bug
>          Components: Clustering, Integration
>    Affects Versions: 0.7
>            Reporter: Jeff Eastman
>            Assignee: Andrew Musselman
>             Fix For: 0.9
>
>         Attachments: MAHOUT-1030.patch, MAHOUT-1030.patch, MAHOUT-1030.patch, 
> MAHOUT-1030.patch, MAHOUT-1030.patch, MAHOUT-1030.patch, MAHOUT-1030.patch
>
>
> Looks like this won't make it into this build. Pretty widespread impact on 
> code and tests and I don't know which properties were implemented in the old 
> version. I will create a JIRA and post my interim results.
> On 6/8/12 12:21 PM, Jeff Eastman wrote:
> > That's a reversion that evidently got in when the new 
> > ClusterClassificationDriver was introduced. It should be a pretty easy fix 
> > and I will see if I can make the change before Paritosh cuts the release 
> > bits tonight.
> >
> > On 6/7/12 1:00 PM, Pat Ferrel wrote:
> >> It appears that in kmeans the clusteredPoints are now written as 
> >> WeightedVectorWritable where in mahout 0.6 they were 
> >> WeightedPropertyVectorWritable? This means that the distance from the 
> >> centroid is no longer stored here? Why? I hope I'm wrong because that is 
> >> not a welcome change. How is one to order clustered docs by distance from 
> >> cluster centroid?
> >>
> >> I'm sure I could calculate the distance but that would mean looking up the 
> >> centroid for the cluster id given in the above WeightedVectorWritable, 
> >> which means iterating through all the clusters for each clustered doc. In 
> >> my case the number of clusters could be fairly large.
> >>
> >> Am I missing something?
> >>
> >>
> >



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to