[
https://issues.apache.org/jira/browse/MAHOUT-1030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13883103#comment-13883103
]
Pat Ferrel commented on MAHOUT-1030:
------------------------------------
using cosine similarity for clustering I'm getting what must be wrong values in
the clusteredPoints. The distance is often > 1
In fact in the above output when the job was putting distance-squared in
clusteredPoints, the results were often larger than 1 too.
using cosine, isn't it impossible to get a distance: 4.92391969868745 or
distance-squared: 9.656875 for that matter?
using kmeans, command line arguments: {--clustering=null,
--clusters=[/Users/pat/big-data/guide/temp/cluster-seeds],
--convergenceDelta=[0.0010],
--distanceMeasure=[org.apache.mahout.common.distance.CosineDistanceMeasure],
--endPhase=[2147483647],
--input=[/Users/pat/big-data/guide/temp/tmp1/pairwiseSimilarity],
--maxIter=[50], --method=[mapreduce], --numClusters=[20],
--output=[/Users/pat/big-data/guide/temp/clusters], --overwrite=null,
--startPhase=[0], --tempDir=[temp]}
Example for one point in clusteredPoints from seqdumper
Key: 950: Value: wt: 1.0 distance: 4.92391969868745 vec: 24 = [39:0.855,
43:0.698, 72:0.829, 260:0.829, 336:0.829, 341:0.829, 346:0.829, 363:0.807,
365:0.896, 427:0.855, 438:0.896, 787:0.855, 795:0.855, 921:0.855, 926:0.855,
932:0.807, 939:0.896, 1144:0.829, 1269:0.896, 1271:0.896, 1273:0.896,
1275:0.896, 1277:0.896, 1278:0.896, 1279:0.896, 1280:0.896, 1281:0.896,
1283:0.896, 1286:0.896, 1287:0.829, 1288:0.896, 1289:0.952, 1290:0.952,
1291:0.896, 1293:0.896, 1294:0.896, 1296:0.896, 1297:0.896, 1299:0.896,
1302:0.896, 1303:0.896, 1307:0.829, 1344:0.855, 1346:0.855, 1394:0.829,
1409:0.855, 1977:0.896, 1978:0.896, 1979:0.896, 1980:0.896, 1981:0.896,
1982:0.896]
> Regression: Clustered Points Should be WeightedPropertyVectorWritable not
> WeightedVectorWritable
> ------------------------------------------------------------------------------------------------
>
> Key: MAHOUT-1030
> URL: https://issues.apache.org/jira/browse/MAHOUT-1030
> Project: Mahout
> Issue Type: Bug
> Components: Clustering, Integration
> Affects Versions: 0.7
> Reporter: Jeff Eastman
> Assignee: Andrew Musselman
> Fix For: 0.9
>
> Attachments: MAHOUT-1030.patch, MAHOUT-1030.patch, MAHOUT-1030.patch,
> MAHOUT-1030.patch, MAHOUT-1030.patch, MAHOUT-1030.patch, MAHOUT-1030.patch
>
>
> Looks like this won't make it into this build. Pretty widespread impact on
> code and tests and I don't know which properties were implemented in the old
> version. I will create a JIRA and post my interim results.
> On 6/8/12 12:21 PM, Jeff Eastman wrote:
> > That's a reversion that evidently got in when the new
> > ClusterClassificationDriver was introduced. It should be a pretty easy fix
> > and I will see if I can make the change before Paritosh cuts the release
> > bits tonight.
> >
> > On 6/7/12 1:00 PM, Pat Ferrel wrote:
> >> It appears that in kmeans the clusteredPoints are now written as
> >> WeightedVectorWritable where in mahout 0.6 they were
> >> WeightedPropertyVectorWritable? This means that the distance from the
> >> centroid is no longer stored here? Why? I hope I'm wrong because that is
> >> not a welcome change. How is one to order clustered docs by distance from
> >> cluster centroid?
> >>
> >> I'm sure I could calculate the distance but that would mean looking up the
> >> centroid for the cluster id given in the above WeightedVectorWritable,
> >> which means iterating through all the clusters for each clustered doc. In
> >> my case the number of clusters could be fairly large.
> >>
> >> Am I missing something?
> >>
> >>
> >
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)