[ 
https://issues.apache.org/jira/browse/MAHOUT-1030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13881283#comment-13881283
 ] 

Pat Ferrel commented on MAHOUT-1030:
------------------------------------

Hmm, Suneel recommends creating a new Jira so I will

Comments from Suneel,

I concur that we r presently not capturing the vectorIds (unless its a Named 
Vector) and also concur that its hard to infer as to which vector belongs to 
which cluster without that. It seems easy to use NamedVector for now to be able 
to determine the vectors that belong to a cluster.

The Clustering algos are only reading VectorWritable() and if the 
VectorWritable() did not have the vector key (i.e. is not a Named Vector) the 
clustering algorithm just wouldn't have it.

See the following code snippet from PartialVectorMergeReducer.java :-

{Code}

    if (namedVector) {
      vector = new NamedVector(vector, key.toString());
    }

    // drop empty vectors.
    if (vector.getNumNondefaultElements() > 0) {
      VectorWritable vectorWritable = new VectorWritable(vector);
      context.write(key, vectorWritable);
    }

{Code}

So from the above code snippet if its not a Named Vector then the corresponding 
vector key is not captured (in the VectorWritable).

The RowIdJob reads the same Tf-Idf vectors and creates a docIndex and matrix (I 
am sure u know their layout and what they are intended for so I'll avoid the 
details here).

The following code snippet from ClusterIterator.iterateSeq() only reads the 
VectorWritable but not the Key:

      for (VectorWritable vw : new 
SequenceFileDirValueIterable<VectorWritable>(inPath, PathType.LIST,
          PathFilters.logsCRCFilter(), conf)) {

It should have been reading a Pair<Text, VectorWritable> to capture the Key for 
the vector as well.

I presently have a 0.9 Release sitting out there in staging waiting to be 
finalized.  Please create a JIRA for this and we should have it fixed in the 
next major release (or Release Candidate).



> Regression: Clustered Points Should be WeightedPropertyVectorWritable not 
> WeightedVectorWritable
> ------------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-1030
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1030
>             Project: Mahout
>          Issue Type: Bug
>          Components: Clustering, Integration
>    Affects Versions: 0.7
>            Reporter: Jeff Eastman
>            Assignee: Andrew Musselman
>             Fix For: 0.9
>
>         Attachments: MAHOUT-1030.patch, MAHOUT-1030.patch, MAHOUT-1030.patch, 
> MAHOUT-1030.patch, MAHOUT-1030.patch, MAHOUT-1030.patch, MAHOUT-1030.patch
>
>
> Looks like this won't make it into this build. Pretty widespread impact on 
> code and tests and I don't know which properties were implemented in the old 
> version. I will create a JIRA and post my interim results.
> On 6/8/12 12:21 PM, Jeff Eastman wrote:
> > That's a reversion that evidently got in when the new 
> > ClusterClassificationDriver was introduced. It should be a pretty easy fix 
> > and I will see if I can make the change before Paritosh cuts the release 
> > bits tonight.
> >
> > On 6/7/12 1:00 PM, Pat Ferrel wrote:
> >> It appears that in kmeans the clusteredPoints are now written as 
> >> WeightedVectorWritable where in mahout 0.6 they were 
> >> WeightedPropertyVectorWritable? This means that the distance from the 
> >> centroid is no longer stored here? Why? I hope I'm wrong because that is 
> >> not a welcome change. How is one to order clustered docs by distance from 
> >> cluster centroid?
> >>
> >> I'm sure I could calculate the distance but that would mean looking up the 
> >> centroid for the cluster id given in the above WeightedVectorWritable, 
> >> which means iterating through all the clusters for each clustered doc. In 
> >> my case the number of clusters could be fairly large.
> >>
> >> Am I missing something?
> >>
> >>
> >



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to