[ 
https://issues.apache.org/jira/browse/MAHOUT-1410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13881974#comment-13881974
 ] 

Pat Ferrel commented on MAHOUT-1410:
------------------------------------

Looks like you changed the distance-squared to distance, which is fine by me. 
since it's labeled correctly.

Get a failure in the mahout build tests but I generally do so I've been 
building without tests. This has been since 0.8 so probably my own problem not 
a real test failure. Some dump info is listed below.

When I build without tests the patch works perfectly on kmeans and fuzzykmeans. 
These are run using the drivers so mapreduce. I checked the the input to see if 
the vector matches the ones in clustered points and indeed they do. So I'd call 
that confirmation. 

Questions i'd ask in a review:
1) Are you changing the type of the vector in clusteredPoints to NamedVector or 
was it always that way with a blank name?
2) If the class is changing are we sure that doesn't mess things up when 
actually using a named vector as clustering input?
3) Not sure how to test "sequential" clustering but if you tell me what you 
mean I can test that too.

This is fixed for me.

pat$ mahout seqdumper -i ~/big-data/guide/temp/tmp1/pairwiseSimilarity/ | more
Input Path: 
file:/Users/pat/big-data/guide/temp/tmp1/pairwiseSimilarity/part-r-00000
Key class: class org.apache.hadoop.io.IntWritable Value Class: class 
org.apache.mahout.math.VectorWritable
Key: 0: Value: 
{5:0.854718643737767,172:0.8292121703264371,192:0.854718643737767,13:0.8962379566075429,198:0.8962379566075429,20:0.8962379566075429,9:0.8962379566075429,19:0.8962379566075429,201:0.8962379566075429,207:0.8962379566075429,43:0.6976395601234899,193:0.8962379566075429,18:0.8962379566075429,184:0.8962379566075429,209:0.8962379566075429,187:0.854718643737767,2:0.8962379566075429,211:0.854718643737767,27:0.7327087555023397,177:0.854718643737767,183:0.8292121703264371,214:0.8962379566075429,17:0.854718643737767,190:0.854718643737767,176:0.8292121703264371,12:0.8962379566075429,191:0.854718643737767}

The above is what I'm clustering so the input and below is the clusterePoints, 
so the output:

pat$ mahout seqdumper -i 
/Users/pat/big-data/guide/temp/clusters/clusteredPoints/part-m-00000  | more
Key: 593: Value: wt: 1.0 distance: 4.487859968341614  vec: 0 = [2:0.896, 
5:0.855, 9:0.896, 12:0.896, 13:0.896, 17:0.855, 18:0.896, 19:0.896, 20:0.896, 
27:0.733, 43:0.698, 172:0.829, 176:0.829, 177:0.855, 183:0.829, 184:0.896, 
187:0.855, 190:0.855, 191:0.855, 192:0.855, 193:0.896, 198:0.896, 201:0.896, 
207:0.896, 209:0.896, 211:0.855, 214:0.896]

Build Test Failure Results :
Failed tests: 
  
ClusterClassificationDriverTest.testVectorClassificationWithOutlierRemovalMR:102->assertVectorsWithOutlierRemoval:188->checkClustersWithOutlierRemoval:238->Assert.assertTrue:41->Assert.fail:88
 not expecting cluster:0:{0:1.0,1:1.0}

Tests run: 761, Failures: 1, Errors: 0, Skipped: 0

[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary:
[INFO] 
[INFO] Mahout Build Tools ................................ SUCCESS [2.347s]
[INFO] Apache Mahout ..................................... SUCCESS [0.519s]
[INFO] Mahout Math ....................................... SUCCESS [1:18.563s]
[INFO] Mahout Core ....................................... FAILURE [5:50.906s]
[INFO] Mahout Integration ................................ SKIPPED
[INFO] Mahout Examples ................................... SKIPPED
[INFO] Mahout Release Package ............................ SKIPPED
[INFO] Mahout Math/Scala wrappers ........................ SKIPPED
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 7:12.848s
[INFO] Finished at: Sat Jan 25 08:45:56 PST 2014
[INFO] Final Memory: 28M/554M
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal 
org.apache.maven.plugins:maven-surefire-plugin:2.16:test (default-test) on 
project mahout-core: There are test failures.
[

> clusteredPoints do not contain a vector id
> ------------------------------------------
>
>                 Key: MAHOUT-1410
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1410
>             Project: Mahout
>          Issue Type: Bug
>          Components: Clustering
>    Affects Versions: 0.8
>         Environment: using 0.9 release candidate
>            Reporter: Pat Ferrel
>            Assignee: Suneel Marthi
>             Fix For: 0.9
>
>         Attachments: MAHOUT-1410.patch, MAHOUT-1410.patch
>
>
> When clustering non-named vectors there are no vector ids in clusteredPoints 
> so the other values there, cluster id, vector values, distance-squared, pdf, 
> cannot be tied to any known vector.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to