[ 
https://issues.apache.org/jira/browse/MAHOUT-1030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13835031#comment-13835031
 ] 

Pat Ferrel commented on MAHOUT-1030:
------------------------------------

Broken record warning: The bigger issue (I agree with Grant about tackling it) 
is that each part of Mahout may or may not support NamedVectors, let alone 
WeightedPropertyVectorWritable. It is probably a big job but before 1.0 it 
would sure be nice to have something like WeightedPropertyVectorWritable 
supported optionally everywhere in Mahout. I've run across many occasions where 
it would save a lot of extra import/export code and do it in a completely 
scalable way. Import/export code is almost always non-scalable because people, 
including me, are too lazy to write external-id to internal-id to external-id 
lookup code in a scalable way. There are a fair number of cases where using a 
WeightedPropertyVectorWritable raises some issues like matrix transpose and 
multiply. Maybe a better way to solve the external to internal to external 
problem is with a scalable implementation supplied as a separate tool in 
Mahout. If there is a feature request for this larger issue maybe that is a 
better place for this discussion. 

As far as this issue is concerned, it is related only to refactoring of the 
clustering code. In 0.6 the distance to centroid was stored in a 
WeightedPropertyVectorWritable. In 0.7 during refactoring, the vector type was 
changed and the distance to centroid was no longer stored with the clustered 
vectors. Restoring this would make iterating through every clustered vector to 
recalculate the distance unnecessary.


> Regression: Clustered Points Should be WeightedPropertyVectorWritable not 
> WeightedVectorWritable
> ------------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-1030
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1030
>             Project: Mahout
>          Issue Type: Bug
>          Components: Clustering, Integration
>    Affects Versions: 0.7
>            Reporter: Jeff Eastman
>            Assignee: Andrew Musselman
>             Fix For: 1.0, 0.9
>
>         Attachments: MAHOUT-1030.patch, MAHOUT-1030.patch, MAHOUT-1030.patch
>
>
> Looks like this won't make it into this build. Pretty widespread impact on 
> code and tests and I don't know which properties were implemented in the old 
> version. I will create a JIRA and post my interim results.
> On 6/8/12 12:21 PM, Jeff Eastman wrote:
> > That's a reversion that evidently got in when the new 
> > ClusterClassificationDriver was introduced. It should be a pretty easy fix 
> > and I will see if I can make the change before Paritosh cuts the release 
> > bits tonight.
> >
> > On 6/7/12 1:00 PM, Pat Ferrel wrote:
> >> It appears that in kmeans the clusteredPoints are now written as 
> >> WeightedVectorWritable where in mahout 0.6 they were 
> >> WeightedPropertyVectorWritable? This means that the distance from the 
> >> centroid is no longer stored here? Why? I hope I'm wrong because that is 
> >> not a welcome change. How is one to order clustered docs by distance from 
> >> cluster centroid?
> >>
> >> I'm sure I could calculate the distance but that would mean looking up the 
> >> centroid for the cluster id given in the above WeightedVectorWritable, 
> >> which means iterating through all the clusters for each clustered doc. In 
> >> my case the number of clusters could be fairly large.
> >>
> >> Am I missing something?
> >>
> >>
> >



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to